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Abstract 

Cluster  analysis  is  widely  used  in  many  applications,  ranging  from  image  and  speech 
coding  to  pattern  recognition.  A  new  method  that  uses  the  weighted  Mahalanobis  distance 
(WMD)  via  the  covariance  matrix  of  the  individual  clusters  as  the  basis  for  grouping  is 
presented  in  this  thesis.  In  this  algorithm,  the  Mahalanobis  distance  is  used  as  a  measure 
of  similarity  between  the  samples  in  each  cluster.  This  thesis  discusses  some  difficulties  as¬ 
sociated  with  using  the  Mahalanobis  distance  in  clustering.  The  proposed  method  provides 
solutions  to  these  problems.  The  new  algorithm  is  an  approximation  to  the  well-known 
expectation  maximization  (EM)  procedure  used  to  find  the  maximum  likelihood  estimates 
in  a  Gaussian  mixture  model.  Unlike  the  EM  procedure,  WMD  eliminates  the  requirement 
of  having  initial  parameters  such  as  the  cluster  means  and  variances  as  it  starts  from  the 
raw  data  set.  Properties  of  the  new  clustering  method  are  presented  by  examining  the 
clustering  quality  for  codebooks  designed  with  the  proposed  method  and  competing  meth¬ 
ods  on  a  variety  of  data  sets.  The  competing  methods  are  the  Linde-Buzo-Gray  (LBG) 
algorithm  and  the  Fuzzy  c-means  (FCM)  algorithm,  both  of  them  use  the  Euclidean  dis¬ 
tance.  The  neural  network  for  hyperellipsoidal  clustering  (HEC)  that  uses  the  Mahalnobis 
distance  is  also  studied  and  compared  to  the  WMD  method  and  the  other  techniques  as 
well.  The  new  method  provides  better  results  than  the  competing  methods.  Thus,  this 
method  becomes  another  useful  tool  for  use  in  clustering. 
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HYPER-ELLIPSOIDAL  CLUSTERING 

I.  Introduction 

1.1  Background 

One  of  the  most  primitive  and  common  activities  of  man  consists  of  sorting  like 
things  into  categories.  The  objective  is  to  group  either  the  data  units  (tanks,  persons, 
images)  or  the  variables  (attributes,  characteristics,  measurements)  into  clusters  such  that 
the  elements  within  a  cluster  have  a  high  degree  of  “natural  association”  while  the  clusters 
are  “relatively  distinct”  from  one  another. 

Cluster  analysis  is  very  important  in  solving  pattern  recognition  problems,  especially 
when  there  are  only  unlabeled  data  points.  Lacking  the  knowledge  of  the  class  membership 
of  the  data  points,  one  cannot  estimate  the  statistical  properties  of  each  class.  Cluster 
analysis  is  also  useful  when  the  actual  distribution  of  the  sample  is  multimodal  or  can  be 
characterized  as  a  mixture  of  several  unimodal  distributions. 

Many  applications  of  pattern  recognition  and  neural  networks  require  clustering  as  a 
tool  of  learning  and  designing  a  pattern  classifier  based  on  the  minimum  distance  concept. 
Unsupervised  or  self-organized  learning  algorithms  have  played  a  central  part  in  mod¬ 
els  of  neural  computation.  Examples  are  Hebbian  learning,  ART,  and  Kohonen  feature 
maps  [9].  Unsupervised  models  have  been  proposed  as  front  ends  for  supervised  learn¬ 
ing  problems  [29].  Different  approaches  to  clustering  appear  in  Gersho  and  Gray  [15], 
Bezdek  [4],  and  Duda  and  Hart  [12].  In  spite  of  the  numerous  research  efforts,  data  clus¬ 
tering  remains  a  difficult  and  essentially  an  unsolved  problem  [18]. 

Clustering  can  be  hard,  where  each  data  point  is  assigned  to  only  one  cluster,  or  soft, 
where  clustering  is  basically  a  function  assigning  to  each  sample  a  degree  of  membership 
u  G  [0, 1]  to  each  cluster  [7].  An  example  of  the  hard  clustering  is  the  algorithm  proposed 
by  Linde,  Buzo,  and  Gray,  commonly  known  as  the  LBG  algorithm  [24].  On  the  other 


hand,  the  most  common  algorithm  in  fuzzy  clustering  is  the  fuzzy  c-means  (FCM)  outlined 
in  Bezdek’s  paper  [7], 

One  of  the  most  important  issues  pertaining  to  clustering  is  the  choice  of  the  similarity 
measure.  A  form  of  the  geometrical  distance  metric  between  points  in  the  feature  space  is 
considered  a  logical  choice.  A  general  form  of  the  distance  between  vectors  x  and  m  is 

D  =  \\x  —  m\\\  =  (x  —  mY A~^{x  -  m),  (1) 

where  d  is  the  dimensionality  of  the  feature  vector  and  A  is  any  positive  definite  d  x  d 
matrix  [24].  The  effect  of  A  is  to  scale  the  distance  along  each  feature  axis.  One  of  the 
most  commonly  used  distance  measures  is  the  Euclidean  distance,  De  =  jjx  —  Tn\\j,  where 
I  is  the  identity  matrix,  and  where  equal  weight  is  given  to  distance  along  each  feature 
axis. 

Partitional  clustering  algorithms  and  competitive  neural  networks  for  unsupervised 
learning  which  use  the  Euclidean  distance  metric  to  compute  the  distance  between  a  pattern 
and  the  assigned  cluster  center  are  suitable  primarily  for  detecting  hyperspherically-shaped 
clusters  [19].  Hence,  clustering  algorithms  with  the  Euclidean  distance  metric  have  the 
undesirable  property  of  splitting  large  or  elongated  (i.e.,  non-hyperspherical)  clusters. 

Other  distance  metrics,  such  as  the  Mahalanobis  distance  (MD),  can  also  be  used. 
The  squared  Mahalanobis  distance  between  a  pattern  x  and  the  cluster  center  m,  is  : 

DM^\\x-mi\\l-{x-mi)'^Cr\x-mi),  (2) 

where  Q  is  the  covariance  matrix  of  the  samples  in  the  cluster  [15].  Using  the  class- 
based  covariance  matrix  normalizes  distance  along  each  axis  of  the  d-dimensional  space 
relative  to  the  variance  in  each  dimension  for  that  class. 

The  use  of  the  Mahalanobis  distance  in  clustering  has  its  own  undesirable  effects. 
It  was  noticed  that  its  use  sometimes  causes  a  big  cluster  to  absorb  nearby  smaller  clus¬ 
ters  [35],  [25].  In  addition  to  this  problem,  other  difficulties  need  to  be  overcome  before  a 
clustering  algorithm  with  the  Mahalanobis  distance  can  be  beneficial. 
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1.2  Problem  Statement 


This  thesis  integrates  the  class-based  Mahalanobis  distance  as  a  similarity  measure 
in  partitional  clustering. 

1.3  Summary  of  Current  Knowledge 

Even  though  the  Mahalanobis  distance  is  well  known  in  the  pattern  recognition 
literature,  the  idea  of  measuring  the  Mahalanobis  distance  to  each  cluster  based  on  its 
covariance  matrix  and  using  these  measurements  as  a  basis  for  partitioning  is  not  common. 

Patrick  and  Shen  [27]  suggested  an  interactive  approach  for  hyperellipsoidal  cluster¬ 
ing.  They  used  the  problem  knowledge  to  provide  the  computer  with  the  initial  center 
of  the  hyperellipsoidal  clusters  and  the  covariance  matrix  for  the  data  in  that  cluster  in 
addition  to  the  “expert’s”  confidence  level  in  those  values.  Their  approach  sequentially 
detects  clusters,  so  there  is  no  competition  between  different  clusters,  and  the  output  clus¬ 
ters  are  affected  by  the  order  in  which  a  cluster  center  is  defined.  This  approach  requires  a 
supervisory  knowledge  of  the  data  that  rarely  occurs,  especially  when  visualization  of  the 
data  is  difficult  or  impossible. 

In  1991,  Jolion  et  al.  [20]  proposed  an  iterative  clustering  algorithm  based  on  the 
minimum  volume  ellipsoid  (MVE)  robust  estimator.  At  each  iteration  in  this  algorithm, 
the  best  fitting  hyperellipsoidal-shaped  cluster  is  detected  by  comparing  the  distribution 
of  the  pattern  inside  a  minimum  volume  hyper-ellipsoid  to  a  cluster  generated  from  a 
Gaussian  distribution.  The  authors  realized  the  computational  burden  of  finding  the  best¬ 
fitting  hyperellipsoid  in  the  entire  feature  space,  so  they  employed  a  random  subsampling 
technique  and  performed  the  clustering  based  on  these  few  samples.  Their  approach  also 
detects  clusters  sequentially  which  may  result  in  a  nonoptimal  clustering. 

In  1996,  Mao  and  Jain  [25]  proposed  a  neural  network  for  hyperellipsoidal  clustering 
(HEC),  which  uses  the  following  regularized  squared  Mahalanobis  distance: 


=  (1  -  X)DMe{x,mi)  +  XDEix,mi),  (3) 
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where  De  is  the  Euclidean  distance  and  Dmc  is  the  Mahalanobis  distance  when  a  small 
number  e  is  added  to  the  diagonal  elements  of  the  class  covariance  matrix  Q.  For  0  <  A  <  1, 
Drm  is  a  linear  combination  of  the  squared  Mahalanobis  distance  and  the  squared  Eu¬ 
clidean  distance.  Introducing  the  regularized  Mahalanobis  distance  results  in  a  tradeoff 
between  the  previously  mentioned  complementary  properties  associated  with  using  either 
the  Euclidean  distance  or  the  Mahalanobis  distance  by  itself. 

This  thesis  analyzes  the  HEC  algorithm  and  introduces  another  clustering  method 
based  on  the  Mahalanobis  distance  to  each  cluster,  weighted  by  a  factor  that  depends  on 
the  population  of  that  cluster  [35] . 

1.4  Scope 

In  this  thesis,  a  new  clustering  technique  based  on  the  class-based  Mahalanobis  dis¬ 
tance  is  implemented  and  compared  with  the  Mahalanobis  distance  based  HEC  algorithm. 
Both  hard  and  fuzzy  algorithms  using  Euclidean  distance  are  used  as  a  baseline  for  com¬ 
parison.  Problems  that  face  the  use  of  Mahalanobis  distance  in  clustering  are  analyzed 
and  solved.  The  data  are  two-dimensional  Gaussian  and  the  well-known  Iris  data.  Vector 
quantization  coding  of  images  is  also  considered  as  a  practical  application. 

1.5  Approach 

The  approach  taken  in  this  investigation  is  composed  of  three  steps.  The  first  step 
is  to  implement  both  the  LBG  technique,  described  by  Linde  et  al.  [24],  and  the  FCM,  as 
outlined  by  Bezdek  [7].  These  techniques  are  validated  with  similar  test  problems  to  those 
shown  in  the  references.  The  second  step  of  the  approach  is  to  implement  HEC  as  a  neural 
network  technique  for  hyper-ellipsoidal  clustering  using  regularized  Mahalanobis  distance. 
This  technique  will  also  be  tested  with  the  same  test  problems  as  the  LBG  and  the  FCM 
techniques.  Finally,  the  weighted  Mahalanobis  distance  (WMD)  algorithm  is  implemented 
and  compared  with  the  previous  techniques. 


4 


1. 6  Research  Objectives 

The  research  objectives  for  this  thesis  are  as  follows: 

1.  Implement  the  LBG  and  the  FCM  approaches  and  verify  their  performance. 

2.  Build  a  neural  network  algorithm  for  performing  principal  component  analysis  (PCA) 
necessary  for  implementing  the  HEC  algorithm. 

3.  Implement  a  neural  network-based  HEC  algorithm  and  compare  its  performance  with 
the  LBG  and  the  FCM  approaches. 

4.  Build  a  better  Mahalanobis  distance  based  clustering  algorithm  that  is  capable  of 
solving  the  problems  associated  with  using  the  Mahalanobis  distance  in  clustering. 

5.  investigate  the  theoretical  justification  for  the  proposed  method. 

6.  Develop  sufficient  understanding  of  how  each  method  works  to  be  able  to  draw  ap¬ 
propriate  inferences  from  the  results,  beyond  simply  the  output  partitioning. 

1.7  Thesis  Organization 

The  remainder  of  this  thesis  is  organized  as  follows:  Chapter  II  draws  upon  a  detailed 
review  of  pertinent  theory  and  methods  relating  to  clustering  to  develop  four  techniques 
for  clustering,  LBG,  FCM,  HEC,  and  WMD.  Chapter  III  describes  the  implementation 
of  these  techniques.  Chapter  IV  presents  and  discusses  the  results  obtained  by  applying 
these  techniques  to  several  real  and  synthetic  data  sets.  Finally,  Chapter  V  summarizes  the 
results  obtained  in  this  investigation  and  provides  recommendations  for  further  research. 

1.8  Summary 

It  is  difficult  for  some  problems  to  obtain  labeled  data,  such  as  in  the  case  of  speech 
recognition  and  speaker  identification.  Hence,  finding  natural  concentrations  in  data  with¬ 
out  supervision  is  necessary.  Since  real-life  data  are  not  isotropically  distributed  around 
well-separated  centers,  a  system  that  can  detect  hyperellipsoidal  shaped  clusters  is  needed. 
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II.  Theory 


2. 1  Introduction 

This  chapter  focuses  on  the  theoretical  concepts  necessary  to  understand  the  methods 
developed  and  used  for  clustering.  In  the  sections  that  follow,  these  areas  are  treated: 

•  Background  of  statistical  pattern  classification. 

•  Benefits  and  description  of  clustering. 

•  General  A:-means  algorithm  and  its  hard  and  fuzzy  variations. 

•  Principal  component  analysis 

•  Clustering  based  on  the  Mahalanobis  distance  to  allow  for  hyperellipsoidally  shaped 
clusters. 

•  Theoretical  foundation  of  the  weighted  Mahalanobis  distance  clustering  algorithm. 

2. 2  Pattern  Classification 

In  statistical  pattern  classification,  the  goal  is  to  assign  a  d-dimensional  input  pattern 
X  to  one  of  c  classes,  Ui,U2, . . .  ,Uc.  If  the  class  conditional  densities,  p(»|wj),  and  the 
a  priori  class  probabilities,  P{ui),i  =  1,2,  ...,c;  are  completely  known,  the  optimum 
strategy  for  making  the  assignment  is  the  Bayes  decision  rule  [12]:  Assign  a  pattern  x  into 
class  Uj  if 

Piu}j\x)  >  P{wi\x),  i  =  l,2,...,c;  i^j 

where 

is  called  the  a  posteriori  probability  of  class  iVj,  given  the  pattern  vector  x  and  p{x)  is 
the  probability  distribution  function  (PDF)  of  the  measurements.  This  rule  minimizes  the 
overall  probability  of  error. 

Bayes  rule  is  a  special  case  of  a  more  general  decision  rule  that  can  be  formulated 
as  a  set  of  c  discriminant  functions,  gi{x),i  =  1,  2,  ...  ,c.  We  assign  x  to  class  Uj  if 
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gj{x)  >  gi{x),  for  all  i  ^  j.  The  output  nodes  of  feedforward  networks,  such  as  multi¬ 
layer  perceptrons  and  radial  basis  function  networks,  can  be  used  to  compute  discriminant 
functions  and  therefore  can  be  used  as  classifiers  [32]. 

As  implied  before,  the  best  discriminant  functions  are  the  a  posteriori  probabilities. 
However,  in  most  practical  applications  neither  the  a  priori  probability  nor  the  conditional 
PDF’s  for  all  classes  are  known,  and  only  a  finite  amount  of  data  is  available  from  each 
class.  This  means  that  the  class  conditional  density  functions  need  to  be  estimated  using 
just  a  finite  number  of  data  points.  Given  a  collection  of  labeled  samples,  two  approaches 
may  be  used  to  estimate  the  a  posteriori  probability  of  each  class.  The  first  is  to  assume 
values  for  the  a  priori  probabilities,  then  estimate  the  probability  distribution  functions 
at  each  point.  Alternatively,  the  a  posteriori  probabilities  may  be  estimated  directly  from 
the  labeled  samples  [12]. 

One  class  of  procedures  for  estimating  density  functions  entails  assuming  a  func¬ 
tional  form  for  each  PDF,  and  then  estimating  parameters  to  fit  that  form  to  the  data. 
One  of  the  estimation  techniques  is  the  maximum  likelihood  estimation  (MLE)  [12].  In 
practice,  however,  it  may  be  unjustifiable  to  assume  the  correct  form  for  each  PDF.  In 
addition,  many  practical  problems  involve  multimodal  densities,  while  nearly  all  of  the 
classical  parametric  densities  are  unimodal.  The  alternative  to  this  class  of  procedures  is 
to  estimate  the  density  functions  directly  from  the  data,  without  assuming  a  functional 
form.  Because  they  do  not  involve  estimates  of  functional  parameters,  these  procedures 
are  generally  referred  to  as  non-parametric  techniques  [12].  Examples  of  the  well-known 
non-parametric  density  estimation  techniques  are  fc-nearest  neighbor  (fc-NN)  and  Parzen 
window  estimators.  The  estimated  class  conditional  density  functions  can  be  used  to  cal¬ 
culate  the  discriminant  functions.  Neural  networks  also  take  this  approach.  It  was  shown 
that  a  multilayer  perceptron  (MLP)  approximates  the  a  posteriori  probabilities  of  the 
classes  being  trained  [31]. 

In  many  practical  cases  the  data  are  of  limited  sample  size  and  also  unlabeled.  This 
is  because  the  class  memberships  are  not  available  or  because  collection  and  labeling  of 
a  large  set  of  sample  patterns  can  be  very  costly  and  time  consuming.  For  instance, 
recorded  speech  is  virtually  free,  but  labeling  the  speech,  marking  what  word  or  phoneme 
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is  being  uttered  at  each  instant,  is  very  expensive  and  time  consuming  [12].  In  this  case 
classification  can  be  accomplished  by  finding  groups  in  the  data  and  then  labeling  these 
categories  as  specific  classes  by  supervision. 

Clustering  can  also  be  used  to  save  time  by  designing  a  crude  classifier  on  a  small, 
labeled  set  and  then  modifying  the  design  by  allowing  it  to  run  without  supervision  on  a 
large,  unlabeled  set.  In  addition,  clustering  can  give  the  analyst  insight  into  the  nature  or 
structure  of  the  data  especially  when  no  prior  knowledge  is  assumed  or  the  data  cannot 
be  visualized.  Finally,  another  important  application  of  clustering  is  feature  selection.  It 
is  important  to  eliminate  redundant  features  to  reduce  the  dimensionality  of  the  prob¬ 
lem.  Taking  into  account  those  distinct  and  well  separated  features  saves  time  and  makes 
parameter  estimation  more  accurate.  For  all  of  the  above  reasons,  clustering  is  a  very 
important  tool  in  pattern  recognition  and  shall  be  defined  in  the  next  section. 

2.3  The  Clustering  Problem 

Cluster  analysis  is  one  of  the  basic  tools  for  identifying  structure  in  data.  Clustering 
usually  implies  partitioning  of  a  collection  of  objects  (tanks,  stock  market  reports,  images, 
cancerous  areas  in  a  mammogram)  into  c  disjoint  subsets.  That  is,  to  partition  a  set  H 
of  n  samples  X  =  •;Xn}  C  into  subsets  'Hi,..,7Yc.  Each  subset  is  to  represent 

a  cluster,  with  objects  in  the  same  cluster  being  somehow  “more  similar”  than  samples 
in  different  clusters.  In  other  words,  objects  in  a  cluster  should  have  common  properties 
which  distinguish  them  from  the  members  of  the  other  clusters. 

In  fuzzy  clustering,  the  clusters  are  not  subsets  of  the  collection  but  are  instead  fuzzy 
subsets  as  introduced  by  Zadeh  [36] .  That  is,  the  “clusters”  are  functions  assigning  to  each 
object  a  number  between  zero  and  one  which  is  called  the  membership  of  the  object  in  the 
cluster.  Objects  which  are  similar  to  each  other  are  identified  by  the  fact  that  they  have 
high  memberships  in  the  same  cluster.  It  is  also  assumed  that  the  memberships  are  chosen 
so  that  their  sum  for  each  object  is  one.  So,  the  fuzzy  clustering  is,  in  this  sense,  also  a 
partition  of  the  set  of  objects. 
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Each  cluster  may  be  represented  by  a  reproduction  vector,  rrii,  that  is  usually  chosen 
to  be  the  Euclidean  center  of  gravity,  or  the  centroid  of  the  samples  in  that  cluster.  Hence, 
it  is  called  the  cluster  center. 

To  measure  a  clustering  algorithm  performance,  one  defines  a  criterion  function  that 
measures  the  clustering  quality  of  any  partition  of  the  data.  The  goal  is  to  find  the 
partition  that  minimizes  the  criterion  function.  The  approach  most  frequently  used  in 
seeking  optimal  partitions  is  iterative  optimization  [12].  The  basic  idea  is  to  start  with 
an  initial  partition  and  iteratively  change  the  location  of  cluster  centers  and  reassign  data 
points  to  the  different  clusters  in  a  way  that  maximizes  the  criterion  function. 

Clustering  or  unsupervised  learning  algorithms  can  be  grouped  into  two  categories: 
hierarchical  and  partitional.  A  hierarchical  clustering  is  a  nested  sequence  of  partitions, 
whereas  a  partitional  clustering  is  a  single  partition  [18].  Commonly  used  partitional 
clustering  algorithms  are  the  fc-means  algorithm  and  its  variations  (e.g.,  ISODATA  [2]) 
which  incorporates  some  heuristics  for  merging  and  splitting  clusters  and  for  handling 
outliers.  Competitive  neural  networks  are  also  used  to  cluster  input  data.  The  well-known 
examples  are  Kohonen’s  Learning  Vector  Quantization  (LVQ)  and  self-organizing  map  [23], 
and  Adaptive  Resonance  Theory  (ART)  models  [8]. 

A  clustering  algorithm  in  this  context  is  equivalent  to  a  vector  quantizer  (VQ)  where 
the  object  is  to  find  a  system  for  mapping  a  sequence  of  continuous  or  discrete  vectors 
into  a  digital  sequence  suitable  for  communication  or  storage  in  a  digital  channel  [17].  The 
goal  of  vector  quantizers  is  data  compression.  They  are  used  to  reduce  the  bit  rate  while 
maintaining  the  necessary  fidelity  of  the  data  [17].  When  used  in  a  vector  quantization 
context,  cluster  centers  are  termed  codewords  and  the  set  of  all  codewords  is  called  a 
codebook.  To  define  the  performance  of  a  quantizer  in  producing  the  “best”  possible 
reproduction  sequence  requires  the  idea  of  a  distortion  measure.  Despite  the  differences 
between  clustering  algorithms  and  vector  quantizers,  their  ultimate  goal  of  exploiting  the 
similarity  between  the  samples  allows  us  to  use  the  terms  clustering  and  vector  quantization 
interchangeably  throughout  this  thesis. 
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Most  variations  in  the  design  of  a  vector  quantizer  or  a  clustering  algorithm  are  in 
the  choice  of  a  quantitative  measure  of  similarity  and  in  the  definition  of  the  criterion 
function,  or  distortion.  Hence,  these  two  critical  topics  are  the  subject  of  the  following 
subsections. 

2. 3. 1  Similarity  Measures.  The  first  important  question  in  a  clustering  algorithm 
design  is  how  to  determine  “more  similar”.  Similarity  measures  can  be  in  a  form  of  non¬ 
metric  similarity  functions  or  distance  metrics.  Duda  and  Hart  [12]  list  a  number  of 
functions  s(x,  m)  whose  values  are  higher  if  the  vectors  x  and  m  are  somehow  similar.  An 
example  measure  is  the  normalized  inner  product  which  is  the  cosine  of  the  angle  between 
the  two  vectors.  Other  similarity  functions  include  the  two-sided  tangent  distance  and 
Tanimoto  distance  which  is  especially  useful  if  the  features  are  binary  [12]. 

Distance  measures  are  more  often  used  to  measure  similarity.  The  smaller  the  dis¬ 
tance  between  two  patterns,  the  greater  the  similarity.  The  assumption  made  in  using 
these  measures  is  that  similarity  between  objects  is  measured  by  the  distance  between 
their  corresponding  data  vectors.  So,  if  the  distances  to  the  mean  of  a  cluster  are  small, 
the  points  in  the  cluster  are  also  close  to  each  other  and  therefore  assumed  to  be  “similar”. 

Mao  and  Jain  [25]  suggest  that  Euclidean  distance  is  the  most  commonly  used  dis¬ 
tance  metric  in  practice.  However,  choosing  the  Euclidean  distance  to  measure  similarity 
implies  an  isotropic  feature  space  weighting.  Consequently,  a  clustering  algorithm  with 
Euclidean  distance  favors  hyperspherically  shaped  clusters  of  equal  size.  The  clusters  in  a 
data  set  are  not  always  compact  and  isolated  in  the  sense  that  the  between-cluster  vari¬ 
ation  is  much  higher  than  the  within-cluster  variation.  Therefore,  using  the  Euclidean 
distance  has  the  undesirable  effect  of  splitting  large  as  well  as  elongated  clusters  under 
some  circumstances  [25].  It  is  easy  to  find  examples  of  data  sets  in  real  applications  which 
do  not  have  spherically  shaped  clusters  of  equal  size.  For  example.  Fig.  1  shows  the  two- 
dimensional  projection  of  the  well-known  Iris  data  onto  a  plane  spanned  by  the  first  two 
principal  eigenvectors.  Since  we  know  the  category  information  of  the  patterns  in  the  Iris 
data  set,  we  label  each  pattern  by  its  category  in  the  projection  map.  There  are  two  ob¬ 
vious  clusters,  but  neither  of  them  has  a  spherical  shape.  From  Fig.  1,  we  can  see  that 
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Figure  1.  Projection  of  the  Iris  data  set  onto  the  plane  spanned  by  the  first  two  principal 
components 


the  larger  cluster  is  a  mixture  of  two  classes  (Iris  Versicolor  and  Iris  Virginica)  which  have 
nonspherical  distributions. 

An  alternative  distance  metric  that  depends  on  the  distribution  of  the  data  itself  is 
the  Mahalanobis  distance  defined  in  Eqn.  (2)  and  rewritten  here  for  convenience. 


Dm  =  II®  “  ”^110  =  (®  “  m)^C  ^{x  —  m) 


(4) 


where  C  is  the  covariance  matrix  of  a  pattern  population,  m  is  the  mean  vector,  and  x 
represents  a  variable  pattern.  The  Mahalanobis  distance  is  especially  useful  if  the  statistical 
properties  of  the  population  are  to  be  considered. 

Figure  2  shows  a  scenario  of  two-cluster  solutions  by  vector  quantization  using  the 
Euclidean  and  Mahalanobis  distance  measures.  It  is  clear  from  the  figure  that  using  the 
Euclidean  distance  results  in  undesired  grouping  of  the  two  classes.  On  the  other  hand, 
using  the  Mahalanobis  distance  allows  the  natural  grouping  of  the  two  clusters. 

The  equi-Euclidean  distance  surface  defines  a  hyper-sphere  of  unity  aspect  ratio, 
which  is  the  ratio  between  the  major  axis  to  the  minor  axis  in  an  ellipse.  On  the  other 
hand,  the  equi-Mahalanobis  distance  surface  defines  a  hyperellipsoid  of  any  shape  which 
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Figure  2.  Results  of  clustering  the  shown  patterns  into  two  classes  using  the  Euclidean 
and  the  Mahalanobis  distance  metrics 

implies  different  orientations  of  the  major  axis  and  any  aspect  ratio.  This  shows  that  the 
modeling  capabilities  of  the  Mahalanobis  distance-based  clustering  is  much  greater  than 
that  of  the  Euclidean  distance-based  algorithms,  especially  as  the  dimensionality  increases 
wherein  the  probability  of  having  hyperspherically  shaped  cluster  is  virtually  zero. 

It  is  clear  from  Figure  2  that  if  we  could  describe  the  shape  of  each  cluster  by  its 
centroid  and  the  sample  covariance  matrix  of  the  data  points  assigned  to  that  cluster,  then 
the  most  useful  similarity  measure  would  be  the  Mahalanobis  distance. 

2.3.2  Optimality  Criterion.  Choice  of  optimality  criterion  is  a  very  important 
issue  in  the  design  of  a  clustering  algorithm.  Especially  in  higher  dimensions,  one  cannot 
visually  determine  how  good  the  resulting  clusters  are.  One  approach  is  to  check  with  a 
criterion  function.  The  criteria,  or  performance  indices,  are  often  specified  as  a  function 
of  the  memberships  whose  minima  or  maxima  define  “good”  clustering.  The  algorithm 
then  becomes  a  numerical  procedure  for  finding  memberships  which  optimize  the  objective 
function. 

Let  D{xi,  Xi)  be  the  distortion  between  a  pattern  Xi  and  the  closest  cluster  center  Xi. 
Or  equivalently,  this  can  be  thought  of  as  the  cost  of  reproducing  any  input  vector  Xi  as 
a  reproduction  vector  Xi.  The  performance  of  the  system  can  be  quantized  by  an  average 
distortion  E{D{X,X)}  between  the  input  and  the  final  reproduction.  Assuming  that  the 
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process  is  stationary  and  ergodic,  then  the  mathematical  expectation  is,  with  probability 
one,  equal  to  the  long  term  sample  average  [17] 

E{D{X,X)}  ^  (5) 

“  j=o 

Equation  (5)  implies  that  if  the  input  source  statistics  do  not  change  with  time  and  if 
the  time  average  equals  the  ensemble  average  then  the  resulting  sample  average  distortion 
should  be  nearly  the  expected  value,  and  the  same  codebook  used  on  future  data  should 
yield  approximately  the  same  averages. 

As  for  the  choice  of  distortion  measure,  it  ought  to  to  be  tractable,  computable,  and 
subjectively  meaningful  so  that  large  or  small  quantitative  distortion  measures  correlate 
with  bad  and  good  subjective  quality  [17]. 

The  simplest  and  most  widely  used  distortion  measure,  although  lacking  the  above 
properties,  is  the  squared  error  distortion  measure, 

d~l 

D{x,x)  =  \\x-x\\]  =  Y^{xi-Xif.  (6) 

*=o 

Substituting  Eqn.  (6)  in  Eqn.  (5)  yields  the  mean  squared  error  (MSE)  criterion 
function.  Dunn  [13]  developed  the  first  fuzzy  extension  to  the  least  mean  square  error 
approach  to  clustering  and  this  was  generalized  by  Bezdek  [6]  to  a  family  of  algorithms. 
Let  n  be  the  total  number  of  samples,  rij  the  number  of  samples  in  and  rrii  be  the 
mean  of  those  samples.  The  generalized  mean  squared  error  is  then  defined  by: 

Jgmse  —  —  (7) 

ft  .  -  .  , 

1=1  j=i 

where  Uij  is  the  membership  of  the  pattern  to  the  cluster,  m  is  a  weighting  exponent 
strictly  greater  than  one,  and  A  is  any  positive  definite  matrix  [7]. 

Many  algorithms  were  designed  based  on  minimizing  Jgmse-  In  these  algoritms,  the 
Jgmse  is  monotonically  nonincreasing  function  as  the  number  of  iterations  increases. 
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Modifying  the  criterion  function  to  allow  the  use  of  a  generalized  distance  as  a  dis¬ 
tance  metric  is  possible  by  defining  A.  Linde  et  al.  [24]  and  Bezdek  [7]  generalized  A  to 
any  d  x  d  positive  definite  matrix.  They  showed  that  if  the  covariance  matrix  for  all  the 
data  is  used,  the  Mahalanobis  distance  is  the  basis  for  the  partitioning. 

Clustering  using  the  global  covariance  matrix  results  in  a  monotonically  decreasing 
objective  function.  However,  all  the  clusters  may  not  have  the  same  distribution  (shape) 
as  that  of  the  whole  data  set.  Hence,  using  the  global  covariance  matrix  for  each  cluster 
may  not  give  the  best  grouping  of  the  data. 

Other  common  criterion  functions  are  the  average  squared  distances  between  samples 
in  a  cluster  domain  and  the  average  squared  distances  between  samples  in  different  cluster 
domains  [33]. 

Backer  and  Jain  [1]  suggested  a  goal-directed  approach  to  the  cluster  validity  problem. 
The  goal  can  be  minimizing  the  classification  error  rate.  Alternatively,  the  clustering 
performance  can  be  considered  good  if  the  clusters  were  dense,  which  means  minimum 
volume  of  the  equi-Mahalanobis  distance  ellipsoids,  or  the  clusters  were  far  apart.  Gath 
and  Geva  [14]  also  considered  a  large  number  of  the  data  points  in  the  vicinity  of  the 
cluster  center  as  an  indication  of  good  clustering.  In  Section  2.6.4  we  will  see  that  the  goal 
is  maximizing  the  likelihood  function. 

2-4  The  Batch  k-means  Algorithm 

A  well  known  algorithm  for  the  design  of  a  locally  optimal  codebook  with  iterative 
codebook  improvement  is  the  generalized  Lloyd  algorithm  (GLA)  [15].  The  two  steps  in 
each  iteration  of  this  algorithm  are: 

Step  1:  Given  a  codebook  M  =  {rrii  ;  i  =  1,. . .  ,A:  }  obtained  from  the  iteration,  assign 
each  data  point  to  the  closest  codeword. 

Step  2:  Obtain  the  codebook  M(+i  by  computing  the  centroid  of  each  cluster  based  on 
the  partitioning  of  Step  1. 
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where  the  closest  codeword  is  typically  found  with  a  distance  metric  based  on  the  as¬ 
sumption  that  similar  feature  vectors  are  likely  to  lie  close  to  each  other  in  the  feature 
space.  The  above  algorithm  is  usually  terminated  when  the  codewords  stop  moving  or  the 
difference  between  their  locations  in  consecutive  iterations  is  below  a  threshold. 

The  GLA  algorithm  is  widely  known  as  the  batch  fc-means  algorithm  when  the  dis¬ 
tortion  measure  used  is  the  sum  of  squared  distances  from  all  points  in  a  cluster  domain 
to  the  cluster  center  [33].  A  sequential  fc-means,  where  the  update  to  the  cluster  centers 
occurs  after  each  pattern  being  classified,  is  proposed  by  McQueen  [26]. 

In  the  next  sections,  two  methods  that  use  variations  on  the  described  fc-means 
algorithm  are  presented.  We  will  discuss  how  these  methods  differ  in  calculating  the 
membership  function  of  the  data  to  the  reproduction  vectors. 

2.4-1  The  LBG  Algorithm.  The  first  algorithm  is  described  by  Linde,  Buzo,  and 
Gray  in  1980  [24],  and  is  also  known  as  the  LBG  algorithm  for  short.  The  LBG  Algorithm 
is  similar  to  fc-means,  but  the  distortion  measure  is  general.  A  survey  of  different  distortion 
measures  is  provided  in  the  paper.  It  was  shown  that  the  best  partition  is  obtained  by 
choosing  the  nearest-neighbor  codeword  to  each  input,  and  no  reproduction  alphabet  can 
yield  a  smaller  average  distortion  than  that  containing  the  centroids  of  the  subsets.  The 
GLA  iterative  method  with  these  rules  is  guaranteed  to  result  in  a  non-increasing  average 
distortion  (based  on  the  distance  from  the  cluster  centers  to  the  data  points  within  the 
clusters)  after  each  iteration,  and  it  terminates  in  a  local  minimum  when  the  average 
distortion  stops  changing  significantly. 

2.4.2  The  Fuzzy  c-means  Algorithm.  In  a  “hard”  clustering  algorithm  such 
as  the  fc-means  algorithm,  each  pattern  is  to  be  assigned  to  only  one  cluster.  However, 
this  “All-or-None”  membership  constraint  is  not  realistic,  since  many  pattern  vectors  may 
have  characteristics  of  more  than  one  class.  Hence  came  the  idea  of  assigning  a  set  of 
membership  function  values  for  each  pattern  to  every  one  of  the  different  classes.  This 
results  in  a  fuzzy  boundary  rather  than  a  hard  one. 
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The  first  and  most  widely  used  fuzzy  clustering  algorithm  is  the  fuzzy  c-means  al¬ 
gorithm  [34].  It  is  an  iterative  procedure  for  finding  a  membership  matrix,  U,  and  a  set 
of  cluster  centers,  M,  to  minimize  the  generalized  mean  squared  error  objective  function, 
see  Eqn.  (7).  By  minimizing  Jqmse,  one  should  obtain  high  memberships  in  cluster  j  for 
data  which  are  close  to  the  vector  rrij. 

The  vectors  mi,  m2, . . . ,  rric  can  be  interpreted  as  prototypes  for  the  clusters  rep¬ 
resented  by  the  membership  functions,  and  are  also  called  cluster  centers.  In  order  to 
minimize  the  objective  function,  the  cluster  centers  and  membership  functions  are  chosen 
so  that  high  memberships  occur  for  points  close  to  the  corresponding  cluster  centers.  The 
number  m  is  called  the  exponent  weight.  It  can  be  chosen  to  “tune  out”  noise  in  the  data. 
The  higher  the  value  of  m,  the  less  the  contribution  of  those  data  points  whose  member¬ 
ships  are  uniformly  low  to  the  objective  function.  Consequently,  such  points  tend  to  be 
ignored  in  determining  the  centers  and  membership  functions  [34].  The  actual  construc¬ 
tion  of  the  FCM  algorithm  is  based  on  the  following  set  of  equations  which  are  necessary 
conditions  for  mi,  m2, . . . ,  me  and  the  membership  matrix  U  to  produce  a  local  minimum: 


rrii  = 


(8) 


and 

'Uij  2  '  (^) 

■^c  /  {Xj-mi)'^A-^{Xj-mi)  \  ’"-1 

y{Xj-mkVA-^Xj-mk)) 

The  update  equations  are  obtained  by  differentiating  the  objective  function  with 
respect  to  the  components  of  V  and  U  subject  to  the  constraint  that  ]Ei=i  '^ij  =  1-  No 
closed  form  solution  has  been  found  for  this  differentiation,  but  these  equations  are  the 
basis  for  an  iterative  procedure  which  converges  to  a  local  minimum  for  the  objective 
function  [34].  One  could  start  with  an  initial  guess  of  the  value  for  the  membership  values, 
U,  for  a  specific  choice  of  m  and  c.  Using  these  memberships  and  Eqn.  (8),  one  computes 
cluster  centers,  then  using  these  cluster  centers  and  Eqn.  (9)  recomputes  memberships  and 
so  forth,  iterating  back  and  forth  between  Eqn.  (8)  and  Eqn.  (9)  until  the  memberships  or 
cluster  centers  for  successive  iterations  differ  by  no  more  than  some  prescribed  value. 
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In  the  cases  where  classification  is  the  goal,  Eqn.  (9)  can  be  used  as  the  basis  where 
maximum  membership  function  value  Wy  means  smaller  distance  and  higher  likelihood 
that  the  data  point  Xi  is  a  member  of  the  j*'*  cluster.  The  class  membership  function 
values  provide  a  measure  of  similarity  by  which  a  soft  decision  can  be  made  with  a  degree 
of  confidence. 

2.5  Principal  Component  Analysis 

One  of  the  topics  that  is  used  in  signal  processing  and  used  in  the  techniques  to  be 
presented  in  the  next  sections  is  the  principal  components  analysis  (PCA).  The  goal  is  to 
find  the  maximum  variance  in  the  data  and  to  transform  the  data  into  a  space  where  the 
features  are  uncorrelated. 

Let  Cx  be  the  covariance  matrix  of  the  data  set  X.  Since  is  real  and  symmetric,  it 
is  possible  to  find  a  set  of  orthonormal  eigenvectors  [16].  Let  and  ipi,  i  =  1,2,. . .  ,d,  be 
the  eigenvectors  and  the  eigenvalues  of  C^.  If  we  arrange  the  eigenvalues  in  decreasing  order 
so  that,  >  ■ . .  >  'ipdi  a  transformation  matrix  whose  rows  are  the  corresponding 

eigenvectors  of  Cx  is  defined  to  be  $. 

If  rrix  is  the  mean  of  the  samples  X,  the  transformation 

y  =  $(»  -  mx)  (10) 

is  called  principal  component,  Karhunen-Loeve,  or  Hotelling  transformation. 

It  can  be  shown  that,  the  mean  of  Y  is  zero  and  its  covariance  matrix  is  given  by  [16]: 

Cy  =  =  ^  =  (11) 

which  means  that  the  features  of  Y  are  uncorrelated.  Furthermore,  if  we  divide  each 
eigenvector  by  the  square  root  of  the  corresponding  eigenvalue,  the  resultant  covariance 
matrix,  Cy,  will  be  the  identity.  Hence,  this  scaled  principal  component  analysis  is  called 
the  whitening  transform. 
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Another  interesting  property  of  this  transform  can  be  seen  if  we  let 


y  =  (12) 

and 

V  =  (13) 

for  any  vector  x  e  X.  Computing  the  Euclidean  distance  between  v  and  y  yields 

DEiy,v)  =  {y-vf{y-v).  (14) 

Substituting  Eqn.  (12)  and  Eqn.  (13)  and  using  the  fact  that  for  any  orthonormal 
matrix 

(15) 

it  can  be  shown  that  Eqn.  (14)  becomes 

DE{y,v)  =  {x -m^fC-\x -m^),  (16) 

which  is  the  definition  of  the  Mahalanobis  distance  between  a  vector  x  and  a  cluster  center 
rrix-  In  other  words,  this  says  that  the  Mahalanobis  distance  between  an  input  pattern  x 
and  a  cluster  center  nix  th®  input  space  is  equivalent  to  the  Euclidean  distance  between 
them  in  the  transformed  space,  i.e., 

Dm{x,  nix)  =  Dsiy,  v).  (17) 

For  any  q  <  d,  the  projection  of  x  onto  the  span  of  {ei,  62, . . . ,  e,}  accounts  for 
the  maximum  fraction  of  the  total  sample  variance  in  x  that  can  be  accounted  for  by  a 
linear  projection  onto  a  5-dimensional  vector  subspace.  Plotting  the  first  few  principal 
components  allows  us  to  visually  examine  the  data  that  account  for  most  of  its  variance. 
These  properties  of  the  principal  component  analysis  will  be  utilized  in  the  Mahalanobis 
distance  based  algorithms. 
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Figure  3.  The  case  where  a  large  cluster  engulfs  a  neighboring  smaller  cluster,  (a)  The 
first  cluster  has  more  samples  than  the  second  class,  (b)  More  samples  are 
assigned  to  the  cluster  of  the  left  as  the  iterative  process  continues 


2.6  Clustering  Based  on  The  Mahalanobis  Distance 

We  have  seen  that  the  LBG  and  the  FCM  algorithms  allowed  A  to  be  any  positive 
definite  matrix.  Any  selection  of  A  implies  that  the  algorithm  is  looking  for  clusters  that 
represent  concentration  of  the  data  about  the  centers,  M,  whose  shape  are  determined  by 
the  matrix  A.  As  stated  before,  if  A  =  7,  the  clusters  identified  tend  to  be  spherical.  In  [7] 
and  [24],  the  authors  suggested  that  A  can  be  the  covariance  matrix  of  all  the  samples. 
This  is  again  considered  imposing  the  shape  of  the  overall  distribution  of  the  data  set  on 
each  cluster,  which  might  not  be  the  case  in  practical  situations.  Assuming  a  different 
matrix  for  each  cluster  has  the  effect  of  detecting  rather  than  imposing  shapes  in  the  data. 

In  the  next  sections,  we  will  first  describe  some  of  the  problems  that  faces  the  use 
of  Mahalanobis  distance  in  clustering.  Next,  we  discuss  two  methods  that  use  the  Ma¬ 
halanobis  distance  from  each  cluster  center  as  a  similarity  measure.  We  will  discuss  how 
these  methods  try  to  solve  the  described  problems. 


2.6.1  Problems  with  Integrating  The  Mahalanobis  Distance  in  Clustering.  There 
are  some  difficulties  associated  with  computing  the  cla.ss-based  Mahalanobis  distance  as  a 
similarity  measure  in  a  fc-means-type  algorithm: 

1.  The  covariance  matrices  needed  for  the  Mahalanobis  distance  calculation  in  the  first 
iteration  cannot  be  determined  a  priori. 
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2.  If  the  number  of  patterns  in  a  cluster  is  small  compared  to  the  input  dimensionality 
d,  then  the  d  x  d  sample  covariance  matrix  of  the  cluster  may  be  singular. 

3.  The  fc-means  clustering  algorithm  with  the  Mahalanobis  distance  tends  to  generate 
unusually  large  or  unusually  small  clusters  [35],  [25].  Figure  3  shows  the  case  of 
large  cluster  engulfing  the  neighboring  cluster.  Points  on  the  perimeter  are  of  equal 
Mahalanobis  distance  to  the  cluster  centers.  Since  the  first  cluster  exhibits  more 
variations  in  both  directions,  its  covariance  matrix  is  larger.  Hence  the  Mahalanobis 
distance  to  the  cluster  center  with  a  larger  A  is  more  likely  to  be  smaller  and  more 
samples  are  assigned  to  that  cluster  in  subsequent  iterations. 

4.  When  performing  the  clustering  based  on  the  individual  covariance  matrices  to  allow 
for  each  cluster  to  have  its  distinct  shape,  Jgmse  does  not  result  in  a  monotonically 
decreasing  function  as  the  number  of  iterations  increases. 

Figure  4  shows  that  as  the  data  cluster  together  in  smaller  groups,  the  variance  of 
data  within  a  cluster  is  now  less  which  will  be  reflected  in  the  covariance  matrix.  The 
distance  between  a  point  x  and  the  mean  m  is  normalized  by  the  smaller  variance. 
So  while  the  Euclidean  distance  is  still  the  same,  the  Mahalanobis  distance  got  larger 
in  the  second  iteration.  The  change  in  the  Mahalanobis  distance  is  indicated  by  the 
equi-Mahalanobis  distance  ellipses.  Therefore,  the  sum  of  the  Mahalanobis  distances 
increases. 

2.6.2  The  Hyperellipsoidal  Clustering  (HEC)  Neural  Network.  As  described  in 
the  previous  sections,  the  LBG  or  the  FCM  algorithm  performs  clustering  using  either  the 
Euclidean  distance  or  using  the  Mahalanobis  distance  beised  on  the  global  covariance  ma¬ 
trix.  Mao  and  Jain  [25]  proposed  a  neural  network  which  implements  partitional  clustering 
using  the  following  regularized  Mahalanobis  distance  between  the  data  point  Xi  and  the 
cluster  center  m*: 


DRM{xi,mk)  =  {xi  -  mk)'^[{l  -  X){Ck  +  el)  ^  +  XI]{xi  -  rrik)  (18) 
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(a) 


(b) 


Figure  4. 


The  results  of  clustering  using  the  Mahalanobis  distance  in  subsequent  itera¬ 
tions.  Jqmse  is  smaller  in  the  first  iteration  (a)  than  in  the  next  iteration  (b). 


where  A  and  e  are  called  the  regularization  parameters.  Note  that  this  regularized  Ma¬ 
halanobis  distance  takes  the  shape  of  each  cluster  into  consideration.  When  A=0  and 
e=0,  Drm  becomes  the  squared  Mahalanobis  distance.  When  A=  1,  Drm  reduces  to  the 
squared  Euclidean  distance.  When  0  <  A  <  is  a  linear  combination  of  the  squared 

Mahalanobis  distance  and  the  squared  Euclidean  distance.  Therefore,  A  can  be  used  as  a 
parameter  to  control  the  degree  that  the  distance  measure  deviates  from  the  commonly 
used  squared  Euclidean  distance.  On  the  other  hand,e  is  added  to  prevent  the  singularity 
of  the  covariance  matrix  C*  as  shall  be  discussed  later. 

Equation  (17)  indicates  that  the  computation  of  the  Mahalanobis  distance  can  be 
decomposed  into  two  steps;  first,  projecting  the  input  pattern  into  the  whitened  space  and 
second,  computing  the  Euclidean  distance  in  the  whitened  space. 

Neural  networks  provide  a  suitable  architecture  for  implementation  of  many  signal 
processing  techniques,  hence,  the  authors  implemented  the  two-step  computation  using  a 
two-layer  network  whose  first  and  second  layers  compute  the  whitening  transform  and  the 
Euclidean  distance,  respectively.  Then,  the  regularized  distance  can  be  computed  by  the 
weighted  sum  of  two  Euclidean  distances:  one  in  the  original  input  space  and  one  in  the 
whitened  space. 
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Figure  5.  Principal  component  analysis  neural  network 

Since  it  is  necessary  to  find  the  eigenvectors  and  eigenvalues  before  such  a  regularized 
Mahalanobis  distance  can  be  calculated,  a  neural  network  version  of  principal  component 
analysis  algorithm  was  chosen  to  perform  the  whitening.  In  the  next  sections,  a  description 
of  the  PCA  neural  network  algorithm  is  outlined.  Then,  the  architecture  for  the  HEC 
network  is  described.  Finally,  a  discussion  of  the  HEC  algorithm  is  provided. 

2.6.2. 1  Principal  Component  Analysis  Subnetworks.  Rubner  and  Tavan  [30] 
presented  a  two-layered  network  whose  weights  converge  to  the  eigenvectors  of  the  covari¬ 
ance  matrix  of  the  input  patterns,  see  Fig.  5. 

All  the  output  nodes  are  hierarchically  organized  such  that  the  output  node  i  is 
connected  to  the  output  node  j  with  connection  strength  if  and  only  if  i  <  j. 

For  an  input  pattern  x,  the  PCA  net  outputs  hj 

hj  =Wj-{x~  m^)  +  fijhi.  (19) 

i<} 
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The  weights  on  the  connections  between  the  input  nodes  and  the  output  nodes  of 
the  PCA  subnetwork  are  adjusted  according  to  the  Hebbian  rule,  i.e., 

Awij  ='f{xi-mi)hj,  i,j  =  1,2,...  ,d-,  (20) 

where  7  is  the  learning  rate.  The  lateral  weights  are  updated  according  to  the  anti-Hebbian 
rule 

Afij  =  —fihihj,  l,j  =  l,2,...,d,  I  <.  j',  (21) 

where  pL  is  another  positive  learning  rate.  Rubner  and  Tavan  demonstrated  that  following 
the  above  update  equations  forces  the  lateral  weights  to  vanish  and  the  activities  of  the 
output  cells  to  become  uncorrelated.  Then  the  weights  on  the  connections  to  the 
output  layer,  Wj,  are  the  eigenvector  corresponding  to  the  largest  eigenvalue.  The 
eigenvalue  is  estimated  by  the  variance  of  hj  after  the  network  converges. 

2. 6. 2.2  Architecture  for  Hyperellipsoidal  Clustering  Network.  Figure  6 
shows  the  diagram  of  the  network  architecture  of  HEC.  The  input  layer  has  d  input  nodes 
corresponding  to  d  features.  The  hidden  layer  is  composed  of  k  x  d  linear  nodes  which 
are  grouped  in  k  subnetworks  with  d  nodes  each,  where  k  is  the  number  of  clusters.  Each 
network  is  designed  to  perform  the  whitening  transform  of  a  cluster.  The  output  layer 
has  k  nodes  corresponding  to  k  clusters,  and  is  basically  a  “winner-take-all”  network.  The 
lateral  connections  have  not  been  shown.  Competitive  learning  is  performed  in  the  output 
layer.  The  input  layer  is  fully  connected  to  the  output  layer.  All  the  output  nodes  in  the 
subnetwork,  however,  are  connected  only  to  the  fcj/,  node  in  the  output  layer. 

Since  each  cluster  needs  its  own  whitening  transform,  the  weights  in  the  PCA  sub¬ 
networks  are  updated  only  when  these  patterns  in  the  corresponding  cluster  are  presented. 
Moreover,  in  the  PCA  learning  mode,  the  centroid  of  the  corresponding  cluster  should  be 
subtracted  from  the  input  pattern  vectors.  Therefore,  before  the  PCA  learning  is  invoked, 
the  cluster  centroid  must  be  learned. 
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Figure  6.  Architecture  of  the  neural  network  for  Hyper  ellipsoidal  clustering  (HEC) 

Each  node  in  the  output  layer  computes  the  weighted  squared  Euclidean  distance 
between  its  inputs  (both  the  input  pattern  and  outputs  of  the  connected  subnetworks)  and 
the  stored  pattern  on  the  connections. 

2. 6. 2. 3  The  HEC  Algorithm.  The  algorithm  starts  with  initial  cluster 
centers,  eigenvectors,  and  eigenvalues  and  then  iteratively  measures  the  regularized  Ma- 
halanobis  distance  to  each  cluster  center.  The  input  pattern  then  updates  the  PCA  sub¬ 
network  for  the  class  that  yields  the  minimum  Drm-  The  algorithm  updates  the  cluster 
centroids  using  a  batch  vector  quantization  algorithm.  Afterwards,  a  new  partition  of  the 
data  set  into  k  clusters  is  formed.  Then,  the  HEC  algorithm  reinvokes  the  PCA  learning 
to  update  the  individual  whitening  transforms  (PCA  subnetworks)  for  all  the  k  clusters. 
If  the  stopping  criterion  is  not  met,  the  value  of  the  regularization  parameter  A  is  reduced. 
The  algorithm  is  terminated  either  when  the  cluster  membership  of  input  patterns  does 
not  change  or  the  maximum  number  of  learning  cycles  is  reached. 

The  HEC  algorithm  tries  to  solve  the  problems  associated  with  using  the  Mahalanobis 
distance  in  clustering,  (see  Section  2.6.1).  It  starts  with  preference  for  the  Euclidean 
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distance  to  address  the  first  problem  concerning  initial  values  for  the  individual  covariance 
matrices.  After  the  data  have  been  partitioned  using  a  batch  vector  quantization  algorithm 
with  Euclidean  distance,  the  means  and  the  covariance  matrices  of  the  individual  clusters 
are  calculated  and  used  for  upcoming  iterations. 

The  second  problem  of  dealing  with  ill-posed  conditions  is  solved  by  the  introduction 
of  the  regularization  parameter  e.  The  role  of  e  is  to  convert  a  singular  covariance  matrix 
to  a  nonsingular  matrix  by  adding  a  diagonal  matrix  with  small  elements.  This  actually 
forces  the  cluster  to  have  a  non-zero  extent  in  all  dimensions. 

The  authors  suggested  the  use  of  larger  value  of  A  when  the  covariance  matrix  C 
cannot  be  reliably  estimated  or  learned.  In  addition,  this  regularized  MD  was  hoped  to 
solve  the  problem  of  large  clusters  swallowing  smaller  nearby  ones.  However,  the  algorithm 
starts  with  initial  value  of  A  =  1  and  decreases  with  each  iteration.  This  suggests  that  this 
problem  is  more  likely  to  occur  very  often  only  during  the  first  few  cycles  of  learning. 

In  this  algorithm  the  performance  was  not  evaluated  by  minimization  of  the  cost 
function  defined  in  the  paper  as  the  Jqmse  with  m  =  1.  However,  the  quality  of  the 
resultant  clustering  is  assumed  to  correlate  with  the  compactness  of  the  output  clusters. 
The  compactness  of  a  cluster  is  measured  by  the  significance  level  of  the  Kolmogorov- 
Smirnov  test  on  the  distribution  of  the  Mahalanobis  distances  of  patterns  in  a  cluster  to 
the  cluster  center  under  the  Gaussian  cluster  assumption.  Moreover,  it  was  not  shown  how 
this  is  incorporated  in  the  algorithm 

2.6.3  The  Weighted  Mahalanobis  Distance  Clustering  Algorithm.  The  idea  be¬ 
hind  the  proposed  method  is  to  allow  each  cluster  to  have  its  own  shape  implied  by  the 
covariance  matrix  of  the  samples  within  that  cluster.  A  cluster  would  attract  those  data 
points  which  enhance  its  shape  and  reject  those  which  do  not  fit.  In  other  words,  each  vec¬ 
tor  X  will  be  assigned  to  the  cluster  to  which  it  best  integrates  regardless  of  the  Euclidean 
separation  between  that  data  point  and  that  cluster’s  centroid. 

In  this  algorithm,  we  modify  the  GLA  described  in  Section  2.4  such  that  in  each  iter¬ 
ation  we  assign  a  pattern  x  to  the  cluster  that  yields  the  minimum  weighted  Mahalanobis 
distance,  Di  =  Wi  *  \  \x  —  where  Wi  is  the  cluster  weight  and  its  computations  will 


25 


be  discussed  later.  In  the  second  step,  we  update  rUj  and  Q  by  computing  the  mean  and 
the  covariance  matrix  of  the  data  points  of  each  cluster  based  on  the  partitioning  of  the 
first  step.  The  algorithm  is  terminated  when  the  codewords  stop  moving. 

Two  ways  to  get  an  initial  codebook  are  discussed  here.  The  first  method  treats  the 
whole  data  set  as  a  unique  cluster  and  grows  the  codebook  to  the  desired  number  of  clusters, 
k,  by  splitting.  In  splitting,  small  perturbation  vectors  p  are  added  to  every  codeword. 
We  made  for  the  cluster  to  be  in  the  direction  of  the  eigenvector  corresponding  to 
the  maximum  eigenvalue  of  Aj.  This  provides  a  systematic  approach  of  distributing  the 
codewords  since  each  cluster  center  splits  in  a  particular  direction  based  on  the  cluster’s 
shape.  The  second  approach  to  get  an  initial  codebook  is  used  when  the  desired  codebook 
size  is  not  an  integer  power  of  two.  This  method  uses  the  Karhunen-Loeve  transformation 
to  place  the  initial  codewords  along  the  principal  component  axes  of  the  data’s  covariance 
matrix.  For  the  first  iteration,  we  use  the  global  covariance  matrix  as  Ai  for  all  the 
codewords. 

If  the  number  of  data  points  in  any  cluster  is  less  than  the  dimensionality  of  the  data, 
then  Ai  might  be  singular  which  prevents  computing  the  inverse.  Thus,  we  add  a  matrix 
with  small  diagonal  elements  to  the  covariance  matrix  to  prevent  the  singularity. 

The  introduction  of  the  weight  w  is  due  to  the  fact  that  the  use  of  Mahalanobis 
distance  alone  in  clustering  sometimes  causes  a  large  cluster  to  attract  members  of  neigh¬ 
boring  clusters.  This  leads  to  unusually  large  and  unusually  small  clusters.  We  evaluate  Wi 
as  the  ratio  of  the  number  of  samples  in  Hi  to  the  size  of  the  training  set.  This  approach  is 
appealing  as  it  does  not  rely  on  an  arbitrary  constant  or  user  adjustable  parameter.  Even 
though  this  weight  penalizes  large  clusters,  it  does  not  prevent  a  cluster  from  having  more 
samples  if  they  are  much  closer  to  it  than  other  codewords.  This  is  similar  to  conscience 
in  neural  networks  which  rewards  a  balanced  representation  of  the  data  points  [10],  [29]. 

In  trying  to  deal  with  the  problem  of  nondecreasing  Jgmsej  one  can  suggest  the  use 
of  another  goal-oriented  criterion  function  such  as  the  sum  of  the  volume  of  the  ellipsoids. 
Nevertheless,  we  recall  that  the  update  equations  for  the  LEG  and  the  FCM  algorithms 
were  computed  based  on  minimizing  Jgmse  when  A  was  fixed  for  all  iterations  and  clusters. 
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However,  as  the  data  cluster  together  in  smaller  groups,  the  variances  decrease  and  the 
Mahalanobis  distance  increases  as  was  explained  before.  Looking  at  the  Jgmsb  criterion 
function  again  and  allowing  Ai  to  be  variable 

Jgmse  =  -  13 (22) 

i=l  j=l 

it  is  clear  that  there  is  a  need  to  restrict  Ai  somehow  in  order  to  obtain  a  nontrivial  solution. 
Otherwise,  the  minimum  of  Jgmse  would  be  given  by  A~^—0,  which  corresponds  to  a  huge 
cluster  with  infinite  variation  in  all  directions. 

One  way  to  solve  this  problem  is  to  force  the  determinant  of  all  clusters  to  have  a 
unity  value.  In  d-dimensional  space,  a  surface  of  constant  Mahalanobis  distance  Dm  is  a 
hyperellipsoid.  The  volume  of  this  d-dimensional  hyperellipsoid  is  [12:24] 

y  =  Va\C\^/^Di„  (23) 

where  Va  is  the  volume  of  an  d-dimensional  unit  radius  hypersphere,  which  only  depends 
on  the  dimensionality  of  the  space: 


(1  6V6n 

(rf/2)!  ’ 


(24) 


nd_id-l-)/2(  d-l\, 

Vd  =  — - dodd. 


(25) 


Hence,  by  restricting  jCj  =  1,  constant  Mahalanobis  distance  surfaces  now  enclose  constant 
volumes. 


This  choice  of  this  constraint  has  been  criticized  because  it  seemed  somewhat  ar¬ 
bitrary  [34].  However,  it  has  the  effect  of  normalizing  the  volume  enclosed  by  the  equi- 
Mahalanobis  distance  hyperellipsoid  to  a  constant  volume  for  all  clusters  while  maintaining 
the  distinct  shape  of  each  cluster. 

After  the  normalization  of  the  determinants,  the  Jgmse  distortion  has  become  mono- 
tonically  decreasing  as  a  function  of  clustering  iterations.  This  means  that  now  we  can 
use  the  criterion  function  in  Mahalanobis  distance  based  algorithms  to  find  the  number  of 
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clusters  k  or  to  define  a  stopping  criterion.  In  the  next  section,  we  will  give  an  analysis  of 
the  weighted  Mahalanobis  distance  algorithm  as  an  approximation  to  the  famous  expec¬ 
tation  maximization  method  as  applied  to  the  problem  of  maximum  likelihood  estimation 
in  the  Gaussian  mixture  model. 

2.6.4  Relation  to  Maximum  Likelihood  Estimate  in  The  Gaussian  Mixture  Model 
( GMM).  The  use  of  different  covariance  matrices  for  different  clusters  was  a  motivation 
to  investigate  the  similarity  between  the  WMD  algorithm  and  the  Gaussian  mixture  model. 
This  presentation  of  GMM  will  follow  the  outline  of  Reynolds  [28] ,  to  which  the  reader  is 
referred  for  a  more  extensive  treatment  of  the  subject. 

A  Gaussian  mixture  density  is  a  weighted  sum  of  L  component  densities  as  given  by 
the  equation: 

L 

p(a:|0)  =  Y^pibi{x),  (26) 

i=l 

where  a;  is  a  d-dimensional  random  vector,  hi{x)  are  the  component  densities  and  pi  are  the 
mixture  weights  for  i  =  1, . . . ,  L.  Each  component  density  is  a  d- variate  Gaussian  function 
of  the  form 


bi{x)  = 


exp 


--(ait  -  rriifCi  \xt  -  m*) 


(27r)<^/2|Q|i/2 

where  rrii  represents  the  mean  vector  and  Q  is  the  covariance  matrix  [12:23]. 


(27) 


The  mixture  weights  satisfy  the  constraint  that  Pi  =  1,  which  ensures  the 
mixture  is  a  true  probability  density  function.  The  complete  Gaussian  mixture  density 
is  parameterized  by  the  mean  vectors,  covariance  matrices  and  mixture  weights  from  all 
component  densities  which  is  represented  by  the  notation 


0  (Pi)  Gj),  i  —  1, . . . ,  L. 


(28) 


It  is  known  that  sometimes  the  observation  density  is  not  well  represented  by  any 
one  parametric  PDF,  such  as  a  single  Gaussian  or  Laplacian  density.  Hence,  the  GMM  can 
be  viewed  as  a  parametric  PDF  based  on  a  linear  combination  of  Gaussian  basis  functions 
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capable  of  representing  a  large  class  of  arbitrary  densities.  One  of  the  powerful  attributes 
of  the  GMM  is  its  ability  to  form  smooth  approximations  to  arbitrarily  shaped  densities. 
Whereas  the  classical  uni-modal  Gaussian  model  could  model  a  feature  distribution  only  by 
a  position  (mean  vector)  and  an  elliptic  shape  (covariance  matrix),  the  GMM  has  L  mean 
vectors,  covariance  matrices  and  weight  parameters  which  allow  it  much  greater  modeling 
capabilities. 

Maximum  likelihood  parameter  estimation  is  a  general  and  powerful  method  for 
estimating  the  parameters  of  a  stochastic  process  from  a  set  of  observed  samples.  It  is 
based  on  finding  the  model  parameters,  ©,  which  are  the  most  likely  to  have  produced  the 
set  of  observed  samples.  This  is  accomplished  by  finding  the  parameters  which  maximize 
the  model’s  likelihood  function  over  a  set  of  observations.  For  a  set  of  observations  referred 
to  collectively  as  X  =  . . .  ,Xn),  the  likelihood  function  for  a  model  ©  is  defined 

as  the  joint  probability  density  of  X  treated  as  a  function  of  ©.  Thus,  the  probability 
density  p(X|©)  is  called  a  likelihood  function  when  it  is  considered  as  a  function  of  ©. 
This  likelihood  function  can  be  calculated  as: 


p(X|©)  =  53p(W,/|©),  (29) 

/ 

where 

n 

p{X,I\e)  =^]\pitbit{xt),  (30) 

t=l 

is  the  joint  PDF  of  X  and  I,  and  the  notation  X)/  denotes  the  sum  over  all  possible 
assignments  of  the  data  set. 


A  necessary  condition  for  the  maximum  likelihood  parameter  estimates  is  that  they 
satisfy  the  likelihood  equation 


dp{x\e) 

dQ 


=  0. 


(31) 


Unfortunately,  attempting  to  solve  Eqn.  (31)  directly  for  the  GMM  parameters  does  not 
yield  a  closed  form  solution  [28].  However,  an  iterative  parameter  estimation  procedure 
which  is  a  special  case  of  the  Expectation  Maximization  (EM)  algorithm  can  be  imple¬ 
mented  to  find  the  Maximum  likelihood  estimates  [3].  The  EM  algorithm  is  the  basis 
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for  several  parameter  estimation  procedures  in  the  area  of  statistical  data  analysis.  The 
widespread  use  of  the  EM  algorithm  stems  from  the  fact  that  it  guarantees  a  non-decreasing 
likelihood  function  after  each  iteration.  The  basic  idea  of  the  EM  algorithm  is,  beginning 
with  an  initial  model,  ©,  estimate  a  new  model,  ©,  such  that  p(X|©)  >  p(X|©).  The  new 
model  then  becomes  the  current  model  and  the  process  is  repeated  until  some  convergence 
threshold  is  reached. 

Reynolds  [28]  stated  that  Baum  [3]  has  derived  an  auxiliary  function. 


Q(©,0)  =  52p(X,7|©)logp(X,7|©),  (32) 

I 

which  exhibits  the  property  that  finding  ©  that  maximizes  Q(©,  0),  results  in  an  increase 
in  the  observed  data  likelihood,  which  is  the  goal.  Reynolds  [28]  showed  that  the  estimation 
equations  for  the  model’s  parameters  are  as  follows: 

The  mixture  weights  are  obtained  by: 


Pi  =  ^'f^P{Wi\Xu@). 


The  component  density  means  are  estimated  as: 

*  Er=i^KK©)' 


(33) 


(34) 


The  component  density  covariance  matrices  are  computed  as  follows: 

^  Er=i  Piwi\xt,  ©)(x  -  mi){x  -  rui)'^ 

Er=i ®) 

Where  in  all  of  the  previous  equations 


(35) 


p{wi\xt,e) 


Pibijx) 

T,l=iPkh{x)' 


(36) 


Using  the  above  equations,  the  EM  algorithm  now  consists  of  the  following  steps: 
Initialize  Start  with  initial  model  parameters,  0° 
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E-Step  Estimate  new  model  parameters  0  using  the  estimation  equations 
M-Step  Replace  current  model  parameters  with  the  new  model  parameters 


Iterate  Iterate  between  the  E  and  the  M  steps  until  the  likelihood  function  stops  increas¬ 
ing 

Looking  closely  at  Eqn.  (36),  which  can  be  rewritten  as 


P{wi\xt,e) 


\Ci\  exp  [-|(a;f  -  ^{xt  -  mj)]  pi 

ELi  exp  -  TTikYC^^ixt  -  rrik)]  Pk  ’ 


(37) 


it  can  be  noted  that  when  the  Mahalanobis  distance  between  Xt  and  the  cluster  mean,  m,, 
is  small  and  between  Xt  and  the  other  means  is  large,  the  a  posteriori  probability  is  high 
(approaching  the  upper  limit  which  is  one).  Using  the  approximation  that 


P('iOi|a:t,0)  «  < 


i  =  arg  [  mm  DM{xt,mj)] 
j 

otherwise. 


i  =  l,...,c 


(38) 


then  Equations  (33)-(35)  become: 


Pi  = 


i  E  1  =  -. 

n  n 


leWi 


(39) 


where  Uj  is  the  number  of  data  points  in  the  cluster.  The  component  density  means 
become: 


nii  = 


and  the  component  density  covariance  matrices  are: 


Q  =  —  13  “  mi){x  -  rriiY  (41) 

xeHi 

which  are  the  same  equations  used  to  calculate  the  weights,  the  means,  and  the  covariance 
matrices  in  WMD. 

This  shows  that  the  WMD  method  is  actually  an  approximate  method  of  finding  the 
actual  distribution  of  the  data  set  as  a  mixture  of  Gaussian  distributions.  Its  theoretical 
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Feature  1 

Figure  7.  The  four  different  Gaussians  detected  using  the  GMM/EM  approach  using 
initial  parameters  estimated  from  applying  the  LEG  algorithm  with  Euclidean 
distance 

justification  is  much  stronger  than  using  only  the  Euclidean  distance  as  shown  in  Eqn.  (36). 
It  is  a  faster  and  much  simpler  technique  to  perform  than  the  Expectation  Maximization 
of  Baum  and  yet  attains  accurate  results  as  will  be  seen  in  Chapter  IV. 

One  important  issue  in  the  GMM/EM  approach  is  the  need  for  an  initial  set  of 
parameters.  One  way  to  come  up  with  the  initial  model  is  to  run  a  clustering  algorithm 
on  the  data  and  estimate  the  mean,  covariance  matrix,  and  prior  for  each  component  of 
the  mixture.  In  fact  GMM/EM  was  not  able  to  come  up  with  the  correct  four  Gaussian 
clusters  of  data  in  Fig.  7  because  it  used  LEG  clustering  with  Euclidean  distance  to  find 
the  initial  partition  from  which  the  initial  model,  ©®,  was  estimated.  The  circles  in  the 
Fig.  7  are  the  equi-Euclidean  distance  from  the  estimated  means  of  the  Gaussians.  It  is 
clear  that  it  made  the  two  fat  clusters  into  one  elongated  Gaussian  cluster.  WMD  does 
not  have  that  problem  because  it  starts  from  the  raw  data. 

It  is  worthwhile  to  mention  that  the  EM  update  equations  reappeared  in  the  literature 
as  an  optimal  fuzzy  clustering  [14],  where  the  authors  defined  the  quantity  — ^  ^  as  an 
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“exponential”  distance  measure  dl{xt,mi)  and  the  component  covariance  matrices  were 
called  fuzzy  covariance  matrices.  In  this  paper,  Gath  and  Geva  realized  that  the  initial 
model  selection  has  an  important  effect  on  the  overall  results.  Therefore,  they  grew  the 
codebook  to  the  desired  number  of  codewords,  one  codeword  at  a  time,  by  a  procedure  that 
places  the  new  codeword  in  a  region  where  the  data  points  have  low  degree  of  membership 
in  the  existing  clusters.  This  is  similar  to  the  way  splitting  in  the  direction  of  maximum 
variance  would  ensure  systematic  distribution  of  the  codewords. 

The  optimal  fuzzy  approach  gave  very  good  results  that  captured  the  shape  of  un¬ 
derlying  substructures  in  many  cases  including  the  Iris  data  set,  linear  distributions,  and 
clustering  of  sleep  EEG  recordings  in  order  to  classify  the  signal  into  various  sleep  stages. 
However,  the  required  computation  associated  with  calculating  the  a  posteriori  probabili¬ 
ties  for  each  pattern  after  every  iteration  can  become  exorbitant. 

2. 7  Summary 

This  chapter  has  provided  a  background  on  clustering  and  has  presented  the  theo¬ 
retical  justification  for  the  approaches  used  in  this  thesis.  The  key  ideas  are  summarized 
as  follows: 

1.  Some  clusters  fall  naturally  in  hyperellipsoids  in  the  feature  space  rather  than  in 
hyperspheres. 

2.  The  WMD  algorithm  allows  for  individually  shaped  hyperellipsoidal  clusters  with 
constraints  on  the  covariance  matrices. 

3.  The  WMD  algorithm  is  related  to  the  expectation  maximization  algorithm. 
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III.  Methods 


3. 1  Introduction 

Having  discussed  in  Chapter  II  the  concepts  used  in  the  four  clustering  techniques 
(LBG,  FCM,  WMD,  and  HEC),  this  chapter  presents  the  details  of  implementing  these 
methods.  First,  Section  3.2  discusses  the  general  approach  followed  in  batch  fc-means-type 
clustering  with  variations  in  all  stages  of  the  algorithm.  Next,  Section  3.3  describes  the 
implementation  of  the  LBG  algorithm.  Next,  Section  3.4  describes  how  the  FCM  ap¬ 
proach  was  implemented.  In  Section  3.5,  the  details  of  the  weighted  Mahalanobis  distance 
algorithm  are  presented.  The  first  three  methods  are  described  together  because  they 
are  similar  in  following  the  general  structure  of  the  fc-means  algorithm,  so  they  may  be 
implemented  together  to  realize  significant  savings  in  computation  and  time.  Section  3.6 
outlines  the  methodology  used  in  building  the  neural  network  for  performing  the  principal 
component  analysis.  Finally,  Section  3.7  describes  the  HEC  network  for  integrating  the 
Mahalanobis  distance  in  partitional  clustering. 

3.2  A  General  Approach  for  Batch  k-means  Clustering 

As  discussed  in  Chapter  //,  the  LBG,  FCM,  and  WMD  techniques  are  all  based  on 
the  generalized  Lloyd  algorithm.  This  section  presents  the  general  approach  taken  by  these 
algorithms,  which  is  shown  in  Fig.  3.2. 

After  we  acquire  the  data  and  preprocess  it,  an  initial  partition  is  needed  before 
the  iterative  process  starts.  Each  clustering  epoch  consists  primarily  of  the  following 


Figure  8.  Flowchart  of  the  general  approach  to  perform  batch  fc-means  clustering 
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steps:  partitioning  according  to  a  specific  distortion  measure,  codebook  update  based  on 
the  results  of  the  first  step,  and  finally,  evaluation  of  the  current  partition  quality  before 
another  iteration  is  initiated. 

The  algorithms  described  here  are  implemented  in  MATLAB®^.  Appendix  A  contains 
the  code  named  CLUSTER  used  to  implement  the  LBG,  FCM,  and  WMD  techniques.  A 
description  of  its  operation  and  a  list  of  its  subroutines  are  also  provided.  The  following 
sections  will  describe  each  stage  of  this  general  code. 

3.2.1  Preprocessing.  A  normalization  procedure  that  is  composed  of  two  steps 
is  performed  for  all  cases.  First,  the  mean  of  the  data  set  is  subtracted  from  it  which 
is  equivalent  to  forcing  the  mean  of  the  input  patterns  to  equal  zero.  This  is  especially 
important  for  the  PCA  network.  Second,  we  scale  all  the  data  sets  to  make  the  maximum 
length  of  feature  vectors  to  be  one  without  changing  the  shape  of  clusters  or  the  relative 
location  or  ratio  between  the  features.  By  doing  so,  the  same  set  of  parameters  can  be 
applied  to  all  the  data  sets,  and  any  initial  codebook  with  small  length  is  guaranteed  to 
be  in  the  middle  of  the  data  points  in  the  feature  space. 

3.2.2  Initialization.  As  stated  in  Section  2.6.4,  the  choice  of  the  initial  model 
is  critical  for  the  quality  of  the  results.  The  parameters  needed  to  start  each  epoch  of 
the  iterative  process  depend  on  the  clustering  algorithm  under  study.  Nevertheless,  all 
clustering  techniques  require  an  initial  codebook  to  improve.  Therefore,  this  section  will 
describe  the  different  ways  of  finding  the  initial  codebook  while  initialization  for  different 
parameters  required  for  the  Mahalanobis  distance-based  algorithms  is  discussed  later. 

A  variety  of  codebook  initialization  schemes  have  been  investigated  in  hopes  of  pro¬ 
viding  accelerated  convergence  of  the  fc-means  algorithm,  achieving  a  better  local  miniTmim, 
and  providing  flexibility  in  terms  of  the  number  of  clusters.  In  the  following  sections,  some 
of  the  techniques  that  are  incorporated  in  CLUSTER  are  described. 

3. 2. 2.1  Initialization  by  Splitting.  Linde  et  al.  [24]  suggested  a  splitting 
method  whereby  one  starts  by  finding  the  centroid  of  all  the  training  data  which  is  the 
optimum  codebook  of  size  one.  This  single  codeword,  rrio,  is  then  split  to  form  two 
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codewords  as  follows: 


(42) 


wi,2  =  mo±p, 

where  p  is  a  small  perturbation  vector.  The  clustering  algorithm  is  then  run  to  get  a  good 
codebook  of  size  two.  The  design  continues  in  this  way  in  stages;  the  final  code  of  one 
stage  is  split  to  form  an  initial  code  for  the  next. 

Sometimes,  the  splitting  technique  is  desired  to  grow  the  codebook,  but  the  number 
of  clusters,  k,  is  not  an  integer  power  of  two.  For  such  cases,  a  modification  of  the  method 
is  implemented  wherein  the  codebook  is  grown  by  splitting  up  to  the  largest  N  codewords 
such  that, 

N  =  2^<k,  i  =  l,2,.... 

At  that  point,  the  codewords  are  sorted  in  descending  order  according  the  population  of 
each  cluster  and  only  the  first  {k  —  N)  clusters  are  split. 

3. 2. 2. 2  Fixed  Codebook  Size  Initialization.  Starting  with  a  codebook  of  the 
right  size  has  the  advantage  that  the  computations  associated  with  growing  a  full-sized 
codebook  by  splitting  are  eliminated.  Thus,  the  design  takes  less  time  to  complete  and 
also  allows  an  arbitrary  number  of  codewords  in  the  final  codebook. 

Perhaps  the  simplest  fixed  codebook  size  initialization  technique  that  is  used  in  the 
fc-means  algorithm  [33]  is  to  use  the  first  k  vectors  in  the  training  sequence  as  the  initial 
prototypes.  Another  simple  approach  is  to  generate  small  random  vectors  as  the  initial 
codebook.  These  initialization  methods  are  inefficient  since  some  of  the  initial  codewords 
may  lie  in  a  region  with  no  patterns,  and  thus  result  in  some  unused  codewords. 

Bezdek  et  al.  [5]  proposed  another  initialization  method  that  disperses  initial  proto¬ 
types  uniformly  along  each  feature  range  starting  from  the  point  Vi  =  {mi,  m2, . . .  ,md)^ 
to  the  point  m*,  =  {Mi,  M2, . .. ,  MdY ,  where  ruj  and  are  the  minimum  and  the  maxi¬ 
mum  values  of  the  feature  for  all  the  data  points,  respectively.  This  method  is  referred 
to  as  gridding  in  CLUSTER. 

Katsavounidis  et  al  [21]  suggested  that  widely  separated  data  points  are  likely  to 
belong  to  different  classes.  Hence,  they  proposed  a  Maximin  initialization  algorithm  based 
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on  the  following  steps:  First,  calculate  the  norms  of  all  vectors  in  the  data  set.  Choose  the 
vector  with  the  maximum  norm  as  the  first  codeword.  Then,  with  a  codebook  of  size  i,  we 
compute  the  distance  between  any  remaining  training  vector  Xj  and  all  existing  codewords 
and  call  the  smallest  value  the  distance  between  Xj  and  the  codebook.  Then,  the  training 
vector  with  the  largest  distance  from  the  codebook  is  chosen  to  be  the  (i  +  1)*^  codeword. 
The  procedure  stops  when  we  obtain  a  codebook  of  size  k. 

DeSimio  [11]  used  the  Karhunen-Loeve  (KL)  transform  to  position  initial  codewords 
along  the  directions  of  maximum  variance.  The  number  of  codewords  in  these  directions 
are  chosen  proportional  to  the  variance  in  that  direction.  Thus,  for  directions  where  the 
data  have  relatively  large  variances,  more  codewords  are  used  than  for  directions  where 
the  variances  are  small. 

The  initial  codebook  generation  method  begins  by  specifying  the  number  of  code¬ 
words  to  be  developed.  The  covariance  matrix,  W,  of  the  d-dimensional  training  data 
is  then  computed  along  with  the  associated  eigenvalues,  i/’j)  =tnd  eigenvectors,  e^,  for 
i  =  1, . . .  ,d. 

The  number  of  codewords  along  each  eigenvector  direction,  ki,  is  found  according  to 


V’i 


-i-0.5 


(43) 


where  k  is  the  desired  codebook  size.  Codewords  are  formed  by  scaling  the  eigenvectors 
such  that  the  ki  codewords  are  uniformly  distributed  over  ±2(7^.  Where  ai  is  the  standard 
deviation  in  the  direction  of  the  eigenvector. 

To  demonstrate  the  effect  of  initialization  on  the  overall  performance  of  a  clustering 
algorithm,  the  LBG  algorithm  with  Euclidean  distortion  measure  was  used  to  partition 
the  Daisy  data  set,  shown  in  Fig.  9,  into  eight  clusters.  Table  1  shows  the  initial  sum  of 
squared  distortion  using  the  initial  codebook  generated  from  the  different  techniques,  the 
final  distortion  after  the  LBG  algorithm  converged,  and  the  number  of  iterations  required 
for  convergence. 
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Scatter  plot  for  Daisy  data  set 


Figure  9.  Scatter  plot  for  the  Daisy  data  set 


Table  1. 


Table  of  the  required  initial  and  final  distortion  and  the  number  of  iterations 
required  by  the  LBG  algorithm  to  cluster  the  Daisy  data  set  into  eight  clusters 


Initialization  Method 

Initial  Distortion 

Final  Distortion 

Number  of  Iterations 

Splitting 

21550 

15880 

17 

First- A:-samples 

295700 

26430 

7 

Maximin 

79710 

15960 

12 

Gridding 

97890 

14690 

8 

KLI 

41860 

13610 

7 
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As  can  be  seen  from  the  results,  the  worst  is  the  “First-fc-samples”  approach  because 
all  the  initial  codewords  may  be  of  the  same  cluster.  Splitting  took  more  iterations  to  give 
better  initial  distortion,  but  it  is  not  guaranteed  to  give  the  best  final  distortion. 

For  the  purpose  of  simulations  in  the  next  chapter,  LBG’s  splitting  was  used.  For  the 
cases  where  the  required  codebook  size  is  not  an  integer  power  of  two,  the  KL  initialization 
(KLI)  technique  was  preferred  to  the  other  techniques  as  it  gives  the  best  overall  results 
as  shown  in  Table  1. 

3.2.3  Partitioning.  The  partitioning  step  consists  basically  of  assigning  each  data 
point  to  the  cluster  whose  centroid  is  nearest  to  that  data  point  in  d  space,  where  d  is  the 
dimensionality  of  the  data  set.  The  minimum  distance  criterion  can  be  used  to  determine 
the  point  membership  of  each  centroid. 

Hence,  partitioning  of  X  into  k  clusters  is  better  described  by  a  n  x  k  matrix,  U, 
whose  entry,  Uij,  is  the  membership  of  the  i*'*  pattern  in  the  cluster.  Therefore,  the 
entries  of  U  are  numbers  between  zero  and  one  and  the  sum  of  the  entries  in  each  row  is 
one. 

Note  that  in  hard  clustering  the  range  of  Uij  is  {0,1},  which  means  that  in  cases  of 
equal  distances  to  more  than  one  cluster  center  a  tie-breaking  rule  is  needed.  In  CLUSTER, 
the  least  numbered  cluster  is  considered  a  winner. 

3.2.4  Codebook  Update.  There  are  two  ways  for  updating  cluster  centers,  the 
sequential  approach,  where  the  codewords  are  updated  after  each  sample,  and  the  batch 
method,  where  the  update  takes  place  after  all  the  data  points  are  assigned  to  the  different 
cluster  centers.  Even  though  the  HEC  algorithm  adjusts  the  weights  adaptively  to  estimate 
the  eigenvectors  and  eigenvalues,  the  actual  update  to  the  cluster  centers  are  done  by  a 
batch  vector  quantization  algorithm  [25].  Therefore,  we  restrict  our  analysis  in  this  thesis 
to  the  batch  mode. 

3.2.5  Clustering  Evaluation.  The  final  logical  step  within  an  iteration  is  the 
evaluation  of  the  clusters  produced  by  the  previous  steps.  Once  formed,  clusters  must  be 
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evaluated  to  determine  if  the  present  configuration  is  optimal  or  whether  modifications  are 
necessary. 

The  following  sections  discuss  in  detail  the  implementation  of  each  of  the  three  clus¬ 
tering  algorithms.  For  each  technique,  the  specifics  about  which  initialization  technique 
is  used,  what  distance  measure  is  implemented,  how  to  perform  the  update,  and  what 
criterion  for  cluster  evaluation  is  necessary  are  clearly  identified. 

3.3  The  LBG  Algorithm  Implementation 

We  followed  the  algorithm  described  in  [24]  using  the  Euclidean  distance  as  a  ba¬ 
sis  for  partitioning.  The  codebook  was  grown  by  splitting,  where  a  fixed  small  random 
perturbation  vector  is  added  to  each  codeword  after  each  splitting  stage. 

The  n  X  c  distance  matrix  containing  the  Euclidean  distance  from  the  data  points  to 
each  codeword  is  calculated  using  Eqn.  (6).  The  optimality  criterion  is  the  mean  squared 
error  (MSE).  The  algorithm  is  terminated  when  (MSE)  stops  decreasing  significantly. 

3.4  The  Fuzzy  c-means  Algorithm  Implementation 

There  is  no  specific  initialization  technique  suggested  for  the  FCM  algorithm  [7]. 
Bezdek  suggested  starting  with  a  random  codebook  but  for  the  purpose  of  comparison, 
the  same  method  for  initialization  was  used  for  all  algorithms.  The  Euclidean  distance 
was  also  used  as  a  basis  for  partitioning  to  compare  with  the  Mahalanobis  distance-based 
algorithms. 

While  in  LBG  only  data  points  which  are  assigned  to  a  specific  cluster  will  update 
its  codeword,  in  FCM  all  the  data  points  will  affect  the  update  of  the  codewords.  In  other 
words,  every  data  point  is  assigned  to  all  the  codewords  but  with  different  degrees.  Recall 
that  if  u  G{0,1}  the  calculation  of  M  reduces  to  computing  the  means  for  the  different 
clusters.  Hence  this  same  equation,  Eqn.  (8),  is  used  for  updating  the  cluster  center  for 
both  the  LBG  and  the  FCM  algorithms. 

Different  values  for  the  exponential  weight,  m,  were  tried.  It  was  found  that  as  m  gets 
larger,  the  effect  of  far  points  on  the  cluster  center  becomes  clearer  by  shifting  its  location 
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towards  them.  Keller  et  al.  found  that,  in  general,  a  value  of  m  =  2  is  appropriate  [22]. 
However,  a  value  of  m  =  3  was  used  for  the  simulations  to  allow  for  more  fuzziness. 

When  using  the  fuzzy  membership  functions  to  perform  classification,  one  considers 
the  degree  of  class  membership  for  each  test  vector.  The  class  membership  function  values, 
U,  computed  with  Eqn.  (9),  provide  a  measure  of  similarity  by  which  a  decision  can  be 
made  to  assign  the  pattern  to  the  class  that  yields  the  highest  membership  value. 


3.5  The  WMD  Algorithm  Implemetation 

The  algorithm  starts  with  measuring  the  mean  and  covariance  of  the  whole  data  set. 
As  suggested  by  [24]  the  required  c  codewords  can  grow  by  splitting.  Splitting  can  be  done 
randomly  or  according  to  other  considerations.  However,  we  think  that  the  best  splitting 
is  that  which  will  put  the  new  codewords  for  a  specific  cluster  far  apart  in  the  direction  of 
the  maximum  variance  of  that  cluster.  We  made  the  perturbation  for  each  cluster  based 
on  eigenvalues  and  eigenvectors  of  its  covariance  matrix  as 


Pi  = 


(44) 


where  and  are  the  maximum  eigenvalue  and  the  corresponding  eigenvector  of  Q, 
respectively.  The  benefit  of  this  approach  is  that  there  is  a  systematic  approach  of  dis¬ 
tributing  the  codebook  vectors  and  each  cluster  will  split  in  different  directions  based  on 
the  distribution  in  that  cluster  independent  from  other  clusters. 

In  cases  where  the  number  of  codewords  is  not  an  integer  power  of  two,  the  KLI 
algorithm  was  used  with  the  covariance  matrix  for  all  data  points  is  used  as  Q  for  all 
clusters. 

Two  ways  to  compute  the  Mahalanobis  distance  are  possible,  using  as  in  Eqn.  (4) 
or  using  the  eigenvectors  and  eigenvalues  of  Cj  as  suggested  by  Eqn.  (17).  The  second 
method  avoids  the  need  to  compute  the  distance  to  each  data  point  separately  so  matrix 
multiplication  can  be  used  instead  of  the  loops  which  results  in  huge  savings.  Moreover, 
the  second  method  turns  away  from  inverting  the  different  covariance  matrices  after  each 
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iterations  and  with  proper  implementation,  we  can  keep  the  eigenvalues  and  eigenvectors 
so  we  can  also  obtain  the  new  codewords  after  each  splitting  stage  as  the  byproduct. 

After  all  the  data  is  presented,  the  codeword  for  each  cluster  will  be  computed  as 
the  average  of  the  data  points  assigned  to  that  cluster.  The  covariance  matrix  of  these 
data  points  forms  the  A  matrix  for  that  group.  The  weight,  Wi,  can  be  calculated  as  the 
ratio  of  the  number  of  samples  in  the  z**  cluster  to  the  size  of  the  training  samples  for  the 
WMD  with  conscience  (WMDC).  For  the  WMD  with  volume  constraint  (WMDVC) 
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where  |Cj|  is  the  determinant  of  Cj. 

The  algorithm  is  terminated  when  the  difference  between  the  location  of  the  code¬ 
words  in  consecutive  iterations  falls  below  a  threshold  or  Jgmse  stops  decreasing  signifi¬ 
cantly  for  the  case  of  WMDVC. 


3.6  Principal  Component  Analysis  Network 

The  proposed  neural  network  was  built  as  shown  in  Fig.  5.  In  using  the  update 
equations,  Eqn.  (20)  and  Eqn.  (21),  it  should  be  noted  that  it  is  required  to  have  the  mean 
of  the  input  patterns  equal  zero.  In  addition,  the  weights  should  be  normalized  after  every 
update. 

The  weight  vector  Wi  converges  to  the  eigenvector  with  the  largest  eigenvalue  of  the 
covariance  matrix  C  of  the  pattern  set.  Subsequently,  the  eigenvector  corresponding  to 
the  next  largest  eigenvalue  is  computed  and  so  on  up  the  number  of  the  output  nodes. 

This  network  was  able  to  find  the  eigenvectors  for  the  covariance  matrix  of  several 
data  sets  accurately  and  quickly.  However,  it  was  noticed  that  this  algorithm  doesn’t 
converge  easily  when  the  eigenvalues  are  small.  A  modification  of  the  learning  rules  was 
implemented  by  Mao  and  Jain  [25].  However,  since  the  authors  didn’t  have  theoretical 
justification  and  introduced  ad  hoc  parameters  that  might  not  generalize  to  other  data 
sets,  the  modified  learning  rules  were  not  implemented. 
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Figure  10.  Flowchart  of  the  HEC  algorithm 


5.7  Hyperellipsoidal  Clustering  Algorithm 

Since  the  PCA  network  didn’t  converge  easily  especially  when  the  eigenvalues  were 
small  and  since  there  are  exact  methods  for  computing  the  eigenvectors  and  eigenvalues  of 
a  matrix,  the  same  algorithm  was  implemented  but  using  MATLAB’s  eig  function  to  find 
the  eigenvectors  and  eigenvalues  after  each  iteration.  The  algorithm  is  depicted  in  Fig.  10. 

For  initialization,  HEC  starts  with  the  identity  covariance  matrices  for  all  clusters 
and  random  cluster  centers.  The  value  of  A  starts  with  unity  and  it  decreases  by  a  specific 
amount,  AA  until  it  reaches  a  specified  minimum  value,  Ap.  AA  and  Xg  are  chosen  to  be 
0.2  and  zero,  respectively,  similar  to  what  was  used  in  [25].  The  value  of  e  on  the  other 
hand  is  chosen  to  be  10~®. 

An  important  characteristic  of  the  HEC  algorithm  is  the  inner  loop  where  the  cluster 
centers  may  change  location  without  updating  the  covariance  matrix  (shape  of  the  equi- 
Mahalanobis  distance  ellipse).  This  was  shown  to  result  in  saving  in  computations  as  it 
avoids  the  need  to  recalculate  the  covariance  matrix  and  it  also  provided  some  improvement 
in  the  results. 

The  stopping  criterion  used  in  both  loops  is  whether  or  not  there  is  a  change  in  the 
partitioning  as  long  as  the  total  number  of  iteration  is  below  a  threshold. 

3.8  Summary 


This  chapter  has  presented  the  methodology  used  in  each  of  the  LBG,  FCM,  WMD, 
and  HEC  algorithms  for  clustering.  The  general  structure  for  a  batch  fc-means  was  in- 


troduced  with  a  description  of  each  stage  of  this  general  algorithm.  A  neural  network 
for  principal  component  analysis  was  discussed.  The  following  chapter  presents  results 
obtained  by  applying  these  techniques  to  several  pattern  recognition  and  Image  coding 
problems. 
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IV.  Results 


4.1  Introduction 

In  this  chapter,  we  investigate  the  performance  of  the  WMD  algorithm  on  several 
data  sets.  Comparisons  are  made  with  the  LBG  algorithm  of  Linde  et  al.  [24]  and  the 
fuzzy  c-means  (FCM)  as  outlined  in  Bezdek  [7],  both  of  which  use  Euclidean  distance.  The 
algorithm  for  hyperellipsoidal  clustering  (HEC)  [25]  which  uses  a  regularized  Mahalanobis 
distance  is  also  compared. 

Each  of  the  four  cluster  analysis  techniques  described  in  the  previous  chapter  is 
applied  to  several  test  cases.  The  first  set  of  tests  utilizes  two-dimensional  Gaussian  dis¬ 
tributions,  which  are  easy  to  both  visualize  and  analyze.  The  results  of  these  tests  is 
described  in  Section  4.2.  Section  4.3  shows  the  results  of  applying  the  different  algorithms 
on  four-dimensional  measured  samples  of  the  Iris  flower.  These  sets  of  cases  are  the  same 
as  those  used  by  Mao  and  Jain  in  their  paper  [25].  Some  of  these  test  cases  are  generated 
by  taking  samples  from  known  classes  so  that  the  true  error  rate  may  be  computed  to 
check  the  accuracy  of  the  classification  provided  by  each  technique.  The  performance  of 
these  algorithms  on  image  coding  vector  quantization  is  also  studied  in  Section  4.4  as  a 
practical  application. 

In  all  cases,  the  performance  metric  is  not  minimum  mean  squared  error  (MSE).  It  is 
well  known  that  MSE  does  not  imply  highest  classification  accuracy  or  highest  perceived 
visual  quality  [12].  Indeed,  in  all  cases,  WMD  codebooks  resulted  in  larger  MSE  than 
the  Euclidean-based  codebooks.  However,  WMD  clustering,  classification  and  perceptual 
quality  performance  are  shown  to  be  superior  to  the  competing  methods. 

For  the  cases  where  the  codebook  size  is  not  an  integer  power  of  two,  clusters  of  the 
training  samples  are  formed  using  the  Karhunen-Loeve  initialization  [11].  The  Karhunen- 
Loeve  initialization  was  chosen  to  initialize  the  codebooks  because  it  disperses  the  desired 
number  of  codewords  along  the  principal  component  axes  of  the  data’s  covariance  matrix, 
allowing  simple  and  efficient  implementation  of  any  number  of  codewords. 
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Table  2.  Results  of  clustering  the  2-dimensional  mixture  of  Gaussians  data  into  three 
clusters  by  the  different  clustering  methods 


Clustering  Algorithm 

7^  of  Errors 

#  of  Iterations 

Megaflops 

LEG 

231 

5 

2.89 

FCM 

204 

20 

4.67 

WMD  (with  conscience) 

3 

8 

3.29 

WMD  (volume  constraint) 

5 

14 

3.81 

HEC 

3 

20 

4.37 

4.2  Two-Dimensional  Test  Cases 

The  WMD  algorithm  was  used  to  design  a  quantizer  for  2-dimensional  data  with  four 
Gaussian  distributed  clusters,  see  Fig.  11(a). 

LEG  clustered  the  data  into  four  clusters  as  shown  in  Fig.  11(b).  On  the  other  hand, 
FCM  did  converge  to  the  desired  four  clusters  after  15  iterations  as  shown  in  Fig.  11(c). 
The  WMD  algorithm  grouped  the  data  points  as  shown  in  Fig.  11(d)  after  four  iterations 
only.  The  large  circles  in  Fig.  11(b)  and  11(c)  show  the  equi-Euclidean  distance  points 
from  the  means.  The  equi-Mahalanobis  distance  ellipses  are  shown  in  Fig.  11(d)  to  follow 
the  shape  of  the  actual  clusters.  This  demonstrates  the  natural  grouping  of  WMD. 

The  WMD  algorithm  was  also  used  to  design  a  quantizer  for  2-dimensional  data 
with  three  Gaussian  distributed  clusters,  see  Fig.  12(a).  LEG  divided  the  data  into  three 
classes  after  5  iterations  as  shown  in  Fig.  12(c).  FCM  classified  the  data  points  as  shown 
in  Fig.  12(d)  in  20  iterations.  The  WMD  with  conscience  (WMDC)  algorithm  converged 
in  8  iterations  to  the  clusters  shown  in  Fig.  12(e)  with  only  3  misclassified  samples  as 
compared  to  the  actual  distributions  that  generated  the  data.  This  is  similar  to  the  results 
of  HEC  algorithm  but  with  fewer  iterations  and  less  computations.  WMD  with  volume 
constraint  (WMDVC),  on  the  other  hand,  resulted  in  5  errors  after  14  iterations.  The 
large  circles  in  Fig.  12(a)  and  Fig.  12(b)  show  the  equi-Euclidean  distance  points  from  the 
cluster  centers.  The  equi-WMD  ellipses  are  shown  in  Fig.  12(c);  note  that  they  follow  the 
shape  of  the  actual  clusters.  We  can  see  that  WMD  outperforms  both  LEG  and  FCM 
and  would  similarly  outperform  any  clustering  technique  that  uses  Euclidean  distance. 
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Feature  2 


Figure  12.  Original  and  computed  partitioning  of  a  mixture  of  Gaussian  data  using  the 
different  clustering  methods,  (a)  Original  classes,  (b)-(f):  Resultant  One- 
nearest  neighbor  classification  using  the  codebook  generated  by  HEC,  LEG, 
FCM,  WMDC,  and  WMDVC  methods,  respectively. 


Table  3.  Results  of  clustering  the  Iris  data  into  three  clusters  by  the  different  clustering 
methods 


Clustering  Algorithm 

of  Errors 

#  of  Iterations 

Megaflops 

LEG 

16 

6 

0.58 

FCM 

15 

12 

1.53 

WMD  (with  conscience) 

5 

7 

0.81 

WMD  (volume  constraint) 

7 

13 

1.41 

HEC 

5 

21 

2.03 

Furthermore,  Mahalanobis  distance  degenerates  to  Euclidean  distance  for  hyperspherical 
clusters  as  shown  by  the  circular  cluster  in  Fig.  12(e)  and  Fig.  12(f). 

4.3  Clustering  of  The  Iris  Data 

We  also  applied  WMD  to  the  well-known  Iris  data  set.  This  data  set  has  four  features 
representing  three  different  species  of  the  Iris  flower.  Figure  13(a)  shows  the  projection  of 
the  Iris  data  on  the  two  main  principal  components.  Figures  13(b)-(f)  show  the  results  of 
minimum  distance  classification  of  the  samples  using  the  different  techniques.  Note  that 
the  projection  of  the  Iris  data  on  the  first  two  principal  components  should  give  us  a  good 
idea  about  the  shape  of  the  clusters  since  the  third  and  fourth  eigenvalues  are  very  small 
so  the  variation  along  those  eigenvectors  are  small. 

After  6  iterations,  LEG  resulted  in  misclassification  of  16  data  points.  FCM  resulted 
in  15  patterns  to  be  misclassified  after  12  iterations.  On  the  other  hand,  WMDC  resulted  in 
only  five  errors  after  7  iterations.  Using  the  WMDVC  resulted  in  7  errors  in  13  iterations. 
The  WMDC  result  is  consistent  with  the  results  of  HEC  [25]  but  achieved  with  our  much 
simpler  design. 

The  reason  for  this  improved  performance  of  the  Mahalanobis  distance-based  clus¬ 
tering  is  that  Iris  classes  are  not  well  separated,  and  they  do  not  occupy  a  hyperspherical 
shape  in  the  feature  space.  These  facts  are  easily  demonstrated  by  Fig.  13(a).  Comparing 
WMD  with  HEC,  WMD  eliminates  the  overhead  of  learning  the  network  weights  and  does 
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Second  Principal  Component  Second  Principal  Component  Second  Principal  Component 


First  Principal  Component  First  Principal  Component 

(e)  (f) 

Figure  13.  Original  and  computed  partitioning  of  Iris  data  using  the  different  clustering 
methods,  (a)  Original  classes,  (b)-(f):  Resultant  One-nearest  neighbor  clas¬ 
sification  using  the  codebook  generated  by  LEG,  FCM,  HEC  ,  WMD  with 
conscience,  and  WMD  with  volume  constraint  methods,  respectively. 
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not  require  an  upper  limit  on  the  number  of  iterations  (100,000  was  used  in  HEC)  [25]. 
Furthermore,  no  ad  hoc  parameters  are  required  for  WMD. 

4-4  Vector  Quantization  Coding  of  Image  Data 

In  this  section,  we  evaluate  the  WMD  clustering  algorithm  in  a  realistic  application, 
specifically,  vector  quantization  coding  of  an  image.  We  have  chosen  this  application, 
in  part,  because  the  computation  is  very  intensive  due  to  the  large  amount  of  data  to 
be  processed.  In  addition,  the  evaluation  results  can  be  expressed  both  numerically  and 
visually,  providing  better  insight  of  the  differences  among  the  algorithms. 

Vector  quantization  image  coding  is  used  to  reduce  transmission  bit  rate  or  data 
storage  while  maintaining  acceptable  image  quality.  In  this  application,  the  mandrill  image 
from  the  University  of  Southern  California  (USC)  image  database  was  used,  see  Fig.  14(a). 
It  consists  of  480  x  500  pixel  digitized  into  8-bit  gray  levels.  Due  to  the  large  size  of 
the  image,  it  was  down-sampled  by  a  factor  of  four  along  both  axes  resulting  in  a  more 
manageable  problem.  In  this  experiment  we  try  to  partition  the  images  into  4  clusters 
and  then  define  the  centroids  of  resulting  clusters  as  the  codewords.  Blocks  of  4  x  4  pixels 
defining  16-dimensional  vectors  were  then  used  to  design  the  codebooks. 

For  the  purpose  of  comparison,  we  test  the  performance  of  the  WMD  algorithm  ver¬ 
sus  both  LBG  clustering  and  Fuzzy  c-means  on  computing  the  codebooks  for  the  images. 
Figures  14(b),  14(c),  and  14(d)  show  the  reconstructed  images  using  the  four-class  code¬ 
book  generated  by  WMD,  FCM,  and  LBG  clustering  methods,  respectively.  Comparing 
these  figures,  we  find  obvious  degradation  of  the  visual  quality  in  LBG  and  FCM  recon¬ 
structed  images  especially  in  the  nose  area.  Only  the  WMD  image  shows  the  three-level 
change  in  the  nose  pixels  and  the  continuity  of  the  dark  gray  level  all  around  the  nose.  The 
covariance  matrix  of  the  pixels  corresponding  to  this  gray  level  was  found  to  be  different 
from  the  identity  which  implies  a  non-hyperspherical  shape.  WMD  groups  the  vectors 
naturally,  which  provides  the  clear  improvement  in  the  quality  of  the  reconstructed  image. 
The  proposed  algorithm  adapts  to  the  local  characteristics  and  preserves  the  features  of 
the  face. 


51 


tuTfr^  * 


Figure  14.  Original  and  reconstructed  Mandrill  images  using  the  different  clustering 
methods,  (a)  Original  Mandrill  image,  (b),  (c),  and  (d):  Reconstructed  im¬ 
age  using  the  four-class  codebook  generated  by  WMD,  FCM,  LEG  methods, 
respectively 
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The  Lena  image  from  the  USC  image  database  was  also  used,  see  Fig.  15(a).  It 
consists  of  480  x  512  pixel  digitized  into  8-bit  gray  levels.  We  used  blocks  of  4  x  4  pixels 
defining  16-dimensional  vectors  to  design  a  four-class  codebook. 

Figures  15(b),  15(c),  and  15(d)  show  the  reconstructed  images  using  WMD,  FCM, 
and  LBG  clustering  methods  respectively.  Comparing  these  figures,  we  find  obvious  degra¬ 
dation  of  the  visual  quality  around  the  shoulders  and  faces  in  LBG  and  FCM  cases.  Only 
the  WMD  image  shows  the  three-level  change  in  the  gray  level  in  the  face  and  the  right- 
hand  side  of  the  background.  WMD  groups  the  vectors  naturally  which  provides  the  clear 
improvement  in  the  quality  of  the  reconstructed  image. 

As  can  be  seen  in  Fig.  15(c),  it  is  possible  in  the  FCM  algorithm  to  have  very  close 
codewords  that  partition  the  space  based  on  almost  equal  membership  function  values. 
Hence  we  see  only  one  gray  level  and  two  white  levels. 

Trying  to  apply  the  HEC  algorithm  on  some  test  cases,  it  was  noted  that  one  or 
no  samples  are  assigned  to  the  centroid  of  a  specific  cluster.  Therefore,  it  is  difficult  to 
compute  a  covariance  matrix  for  that  cluster  and  the  algorithm  ends  up  with  less  than  k 
clusters. 

This  problem  can  be  attributed  to  two  factors:  First,  random  initialization  may  result 
in  cluster  centers  that  is  farther  away  from  the  samples  than  the  other  clusters.  Such  case 
was  encountered  when  HEC  was  applied  to  Daisy  data  set,  see  Fig.  9.  Even  though  the 
authors  didn’t  suggest  a  solution  to  this  problem,  this  problem  was  fixed  by  splitting  the 
cluster  that  has  the  largest  number  of  data  points.  This  solution  solved  the  problem  for 
the  Daisy  data  set  but  when  HEC  was  applied  on  the  image  vectors  it  failed  to  converge. 

The  failure  to  converge  on  the  image  coding  problem  and  other  cases  points  to  the 
second  possible  cause  of  this  behavior  which  is  the  problem  of  large  clusters  engulfing 
smaller  ones.  See  problem  three  in  Section  2.6.1.  It  is  clear  that  after  several  iterations, 
the  value  of  A  becomes  zero  and  the  regularized  Mahalanobis  distance  reduces  to  Purely 
Mahalanobis  distance  without  any  constraint  on  the  covariance  matrix.  The  fact  that  this 
problem  doesn’t  occur  for  the  WMD  method  reassures  the  need  for  a  constraint,  such  as  the 
conscience  or  the  volume  constraint,  when  using  the  Mahalanobis  distance  in  clustering. 
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(c)  (d) 

Figure  15.  Original  and  reconstructed  Lena  images  using  the  different  clnstering  methods. 

(a)  Original  Lena  image,  (b),  (c),  and  (d):  Reconstructed  image  using  the 
four-class  codebook  generated  by  WMD,  FCM,  LEG  methods,  respectively 
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J^.S  Summary 

This  chapter  has  shown  the  results  obtained  by  applying  each  of  the  four  techniques  to 
two  two-dimensional  and  one  four-dimensional  test  cases.  In  addition,  vector  quantization 
coding  of  two  images  were  intried.  In  general,  these  results  indicate  that  the  Mahalanobis 
distance  is  a  useful  metric  to  be  used  as  a  similarity  measure.  Nevertheless,  it  was  shown 
that  there  exists  a  necessity  to  control  the  covariance  matrix  of  each  cluster  in  order  to 
obtain  good  results.  The  next  chapter  summarizes  the  results  obtained  in  this  investigation 
and  provides  some  recommendations  for  further  research. 
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V.  Conclusion 


5. 1  Introduction 

The  primary  objective  of  this  research  was  to  examine  the  use  of  the  class-based 
Mahalanobis  distance  in  clustering  and  develop  a  method  to  solve  the  problems  associated 
with  such  a  use.  The  second  objective  was  to  compare  its  performance  to  Euclidean 
distance  based  algorithms  and  other  Mahalanobis  distance  based  algorithms.  Both  of  the 
objectives  were  met. 

5.2  Contributions 

•  This  thesis  introduced  a  method  for  integrating  the  Mahalanobis  distance  into  batch 
fc-means-type  algorithm.  This  algorithm  has  several  attributes  that  make  it  well 
suited  for  more  general  clustering  problems:  First,  the  system  is  robust  in  that 
it  achieved  similar  success  in  all  of  the  test  cases  studied.  Second,  the  algorithm 
is  related  to  the  expectation  maximization  process  so  it  has  the  same  theoretical 
foundation  as  the  Gaussian  mixture  models.  Third,  it  builds  the  codebook  starting 
from  the  data  itself  using  an  improved  splitting  method  that  takes  the  different  shapes 
of  clusters  into  account. 

•  The  analysis  showed  that  the  covariance  matrices  in  the  Mahalanobis  distance  based 
algorithms  need  to  be  constrained  in  order  to  reach  a  non-trivial  solution.  Two 
different  constraints  were  discussed  and  implemented. 

•  As  a  tool  for  follow-on  research,  this  thesis  intentionally  provides  a  significant  amount 
of  background  information  in  the  area  of  clustering.  Chapter  II  was  specifically 
written  to  assist  a  novice  to  the  field  in  learning  many  of  the  clustering  problem 
fundamentals.  The  software  developed  in  this  effort  contains  numerous  selections  of 
clustering  techniques  and  parameters.  This  provides  ease  in  application  or  replication 
of  this  work  for  further  research. 
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5.3  Follow-on  Research 


The  research  discussed  in  this  thesis  is  by  no  means  exhaustive.  As  with  any  large 
undertaking,  there  are  many  areas  left  for  further  research.  Some  possible  enhancements 
are  listed  below: 

•  Preliminary  results  showed  that  the  mutual  Mahalanobis  distances  between  the  clus¬ 
ter  centers  rrij  and  rrij  using  both  and  Cj  are  good  indicators  of  the  similarity 
between  the  two  clusters  in  terms  of  shape  and  orientation.  These  indicators  can  be 
the  basis  for  merging  similar  clusters. 

•  Exploring  the  possibilities  for  finding  the  appropriate  number  of  clusters,  k,  by  means 
of  merging  clusters  based  on  the  Mahalanobis  distance  between  their  centroids  and 
finding  a  monotonically  decreasing  criterion  function  as  the  number  of  clusters  k 
increases,  such  as  Jqmse  in  WMD  with  volume  constraint.  The  best  number  of 
clusters  will  correspond  to  the  “knee”  of  the  curve  where  further  increase  in  k  results 
in  only  a  slight  decrease  in  the  criterion  function. 

•  Developing  a  way  to  handle  outliers  which  affect  the  covariance  matrix  and  hence 
the  overall  performance  of  a  Mahalanobis  distance  based  algorithm. 

5.4  Summary  of  Results 

Mahalanobis  Distance  is  a  very  useful  tool  which  is  believed  to  give  a  better  measure 
of  similarity  than  the  Euclidean  distance.  It  was  shown  that  for  certain  data  sets,  the 
Euclidean  distance  based  clustering  algorithms  do  not  group  the  data  points  naturally. 
The  weighted  Mahalanobis  distance  clustering,  on  the  other  hand,  results  in  codebooks 
that  represent  the  data  better  than  codebooks  generated  with  either  the  LEG  or  the  FCM 
algorithms. 

Even  though  the  neural  network  for  hyperellipsoidal  clustering  (HEC)  uses  the  Ma¬ 
halanobis  distance,  it  failed  to  perform  as  well  as  the  WMD  algorithm.  This  is  because  it 
could  not  solve  the  problems  associated  with  using  the  Mahalanobis  distance  in  clustering. 

The  WMD  algorithm  has  been  demonstrated  using  both  artificial  data  and  real  data 
to  outperform  existing  methods  in  maintaining  the  natural  distribution  of  the  feature  space. 
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It  was  also  shown  that  WMD  clustering  accounts  for  variability  in  cluster  shapes,  cluster 
densities,  and  the  number  of  data  points  in  each  of  the  subsets. 

The  techniques  developed  for  this  thesis  should  be  applicable  not  only  to  other  syn¬ 
thetic  data,  but  to  any  data  sets  which  are  characterized  by  non-spherical  cluster  shapes. 

5.5  Conclusions 

The  results  obtained  in  this  thesis  lead  to  several  interesting  conclusions: 

•  Hyperellipsoidal  clusters  are  more  general  and  more  likely  to  be  encountered  in  real- 
life  applications  than  hyperspherical  clusters. 

•  Increased  generality  allows  better  classification  and  perceptual  image  quality  than 
Euclidean  distance  based  algorithms. 

•  Unlike  HEC,  WMD  was  able  to  overcome  the  difficulties  associated  with  using  the 
Mahanobis  disance  in  clustering. 

•  WMD  has  been  related  to  the  EM  algorithm  for  finding  the  parameters  in  a  Gaussian 
Mixture  Model 
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Appendix  A.  Clustering  Codes 

(R) 

This  appendix  contains  the  MATLAB^  codes  which  implement  the  LBG,  FCM, 
WMD,  and  HEC  approaches. 

To  allow  for  saving  in  time  and  computation,  a  system  for  clustering  is  generated  with 
numerous  selections  of  data  sets,  codebook  size,  distance  metrics,  initialization  techniques, 
and  partitioning  rules  (hard  or  fuzzy) .  Therefore,  a  user  can  study  the  results  of  using  these 
algorithms  with  different  parameters.  In  combining  the  algorithms  together  in  CLUSTER, 
some  redundant  operations  are  eliminated,  such  as  reloading  the  data,  normalization,  and 
codebook  initialization.  In  fact,  it  is  possible  to  have  the  code  run  all  the  algorithms  with 
all  different  choices  or  parameters  and  save  the  results  and  figures  with  different  file  names 
without  supervision.  This  automatic  storing  of  output  files  and  figures  saves  a  lot  of  time 
and  enables  the  user  to  either  start  looking  at  the  results  while  the  code  is  running  or  leave 
the  code  running  and  examine  the  results  later. 


A.l  cluster,  m 

fimction  [clus_centr_new,A_inv,lisdist,asigned_class]=. . . 
Cluster (Data_ choice, codbksize,hard_fuzzy, . . . 
Mpara,Distnce , init_clus , weight) 


‘/,[clus_centr_new,A_inv,lisdist]  =  . . . 

Cluster (Data_ choice , codbksize ,hard_fuzzy ,Mpara,Distnce , init_clus .weight) 

y. 


'/o  Capt  Khaled  Younis,  RJAF 

y. 

y.  Aug  26,  1996 

y. 


y,  A  clustering  package  that  perform  Heird/Fuzzy  c-means 
y,  using  Euclidean/Weighted  Euclidean/Mahalanobis  distance 
y,  Initialization  by  Splitting/KLI/Min-max/Gridding  or  Randomly 

y.y.y.y.y.y.y.  Get  parameters 
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y,  Data_choice=  input  (’Data  set  ... 

(1)  XOR  data  (2)  Gaussians  (3)  IRIS  (4)  Daisy  (5)  Mandrill  . . . 
(6)  uniform  (7)  flir  (8)  Lenna  (9)  sub-manifold  (10)  Linear:  ’) 
y,  codbksize  =  input (’ Codebook  size  :  ’); 
y,  hard_fuzzy  =  input ( ’Clustering  Algorithm  ... 

(1)  LEG  (  HCM)  (2)  Fuzzy  C_means  :  ’); 
y,  init_clus  =  input (’ Codebook  Generation  ... 

(1)  Splitting  (2)  Fixed  size  (first  C  samples)  (3)  Maximin  . . . 
(4)  Bezdek  (5)  Desimio  (6)  get-then-split  (7)  random  :  ’); 
y,  Distnce  =  input (’ Similarity  Measure  ... 

(1)  Euclidean  Distance  (2)  Diagonal  covarieince .  . . 

(3)  changing  Mahalanobis  :  ’); 
y,  weight  =  input  (’Weighting  ... 

(0)  No  weight  (1)  Conscience  (2)  Determinemt  constraint  :  ’); 
y,  Mpara=0;  */,  to  set  it  zero  if  not  fuzzy 
y.  if  hard_fuzzy==2 

y  Mpara  =  input (’M  parameter  for  Fuzzy  C_means  :  ’); 
y,  end 

y,  end  y,if  nargin 

y, - End  get  parameters 

if  nargin==6 
weight=l ; 
end 

flops (0) 
tic 
rsq=2; 
close  all 
plt=l; 

V/tlXVIt  Initalize  parameters  for  the  whole  program  */,’/, 
lisdist=[];  ’/.To  record  the  distortion  vs.  number  of  iterations 
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epsl=0.001;  */, required  cheinge  rate  in  Distotion 
epsl2=le-6;  */, Required  min  change  in  cc  location  (fen  of  data) 
savplt=l;  ’/,(1)  saves  (0)  Not  to  save  the  plots 
savres=2;  ‘/.(O)  No  save  .. 

(1)  saves  in  ASCII  . . . 

(2)  save  in  variables  in  .mat 
M=l;  '/(For  splitting 
cls_tot=zeros (codbksize , 1) ; 

'/, - End  Initialize  parameters 


muu  Get  data 
data_in=getdata(Data_ choice) ; 

[row , col] =size (data_in) ; 

av=mean(data_in) ; 

norm_data=data_in-ones (row, 1) *av ; 

max_length=max (diag(norm_data*norm_data’ ) ) ; 

data_in==norm_data . / (ones (row , col) ^sqrt (max_length) ) ; 

asignd_class=ones (row, 1) ; 

data_cl=[asignd_class  data_in] ; 

covr=cov(data_in) ; 

covr_all=covr ; 

mx_data=max(data_in) ; 

mn_data=min(data_in) ; 

mx=max (mx_data) +1 ; 

mn=min(mn_data) -1 ; 

y, - end  get  data 


VIXIXVIt  Suitable  matrix  for  ellipse  drawing  y,y,y,D:disnce 
if  Distnce==l, 

elps_cov=reshape (eye (size (covr))  ,l,col‘'2)  ; 
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elseif  Distnce==2, 


elps_cov=zeros(size(covr)) ; 

elps_cov=reshape(diag(diag(covr) ,0) ,l,col~2) ; 
elseif  Distnce==3, 

elps_cov=reshape(inv(covr)  ,l,col'2) ;  */,  initially  Global  covarieince  matrix 

*/,elps_cov=reshape(eye(col)  ,l,col~2)  ;  */,  initially  Euclidean  distance 

else  error (’invalid  choice’) 
end  '/,  if  dist 

A_inv=elps_cov;  ‘/.redundant  for  other  than  WMD 
%___ - 


*/,  For  automatic  savings  of  the  figures  and  the  results 

Which=[Data_ choice  codbksize  hard_fuzzy  Mpara  Distnce  init_clus  weight] 

imV/,  Get  initial  cluster  centers  mmUU 

if  init_clus®=l  */, splitting 

A0=av; 

clus_centr_old=AO;  '/.initial  codebook  vector  (centroid) 

elseif  init_clus==2  */.  C  samples 

clus_centr_new=data_in(l : codbksize,  :)+10~ (-6)  ;  '/.  first  c  samples 

elseif  init_clus==3  '/.Minimax 

*/.  eta=.2;  ’/.  min  percetage  of  the  disteince  to  last  cc 

*/.  clus_centr_new=min_max  (data_in,  codbksize ,  eta)  ; 

clus_centr_new=turbo_mm(data_in, codbksize) +eps ; 
elseif  init_clus==4  ’/.  Bezdek’s  Gridding 
dif_over_N=l/(codbksize-l)* (mx_data-mn_data) ; 
for  i=l : codbksize 

clus_centr_new(i, : )=mn_data+(i-l)*dif _over_N; 
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end  7, for  i 

clus_centr_new=clus_centr_new+10" (-6) ; 

elseif  init_clus==5  7.KLI 
c lus  _  c  ent r _ne w=kl i ( dat  a_ in , c  odbks ize ) ; 

elseif  init_clus==6 
load  ./autosaves/mandinah4splt2 
giv_clus_centr  =  clus_centr_new; 
load  ./autosavespert 
giv_pert  =  pert; 

M=length(giv_pert (:,!)); 
elseif  init_clus==7  7,  random 
clus_centr_new=randn(codbksize ,col) ; 
else  error (’ invalid  choice’) 
end  7.if  init_clus 

I - 


while  M<codbksize 

7.###  Start  splitting  loop  #################### 
if  init_clus==l  7oSplitting 
if  M==l 

[vec,val]=eig(covr_all) ; 
pert=(vec*sqrt(diag(val))) ’ ; 
end 

pertlbg=(randn(l , col)- . 5) *le-l*mean(mean(data_in) ) ; 
7o7o7.7o7.  Get  cc  &  pert  after  splitting  7.7o7.7.7.7,7t7o7.7o7Q7.7.7o7. 

if  2*M<=codbksize  '/,  Splitting  by  two 
if  Distnce  ==3 
clus_centr_new=. . . 
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[clus_centr_old+pert/M/2; clus_centr_old-pert/M/2] ; 
else 

clus_centr_new=. . . 

[clus_centr_old+ones(M,l)>fpertlbg; . . . 
clus_centr_old-ones(M, l)*pertlbg] ; 
end 

A_inv=[A_inv;A_inv] ; 

M=2*M; 

elseif  2*M>codbksize  '/,  add  required  number 
num_sp=codbksize-M; 

[Srt , IND] =sort (cls_tot) ; 
if  Distnce==3 

clus_centr_new(IND(M:-l;M-num_sp+l), :)=. . . 
clus_centr_old(IND(M:-l :M-num_sp+l) 
pert (IND (M : -1 ;M-num_sp+l) , :)/M/2; 
clus_centr_new=[clus_centr_new; . . . 
clus_centr_old(IND(M:-l :M-num_sp+l) , . . 
pert (IND (M:-l:M-num_sp+l) , :)/M/23 ; 
else 

clus_centr_new(IND(M:-l:M-ntun_sp+l) , :)=. . . 
clus_centr_old(IND(M:-l:M-num_sp+l) , :)+. . . 
one  s (num_  sp,l)*pertlbg; 
clus_centr_new=[clus_centr_new; . . . 
clus_centr_old(IND(M:-l:M-num_sp+l) 

-ones(num_sp, l)*pertlbg] ; 
end 

A_inv= [A_inv; A_inv(IND(M:-l :M-num_sp+l) , ; 

M=M+num_sp; 

end  */,  if  2*M 

% - End  Get  cc  &  pert 
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elseif  init_clus==6 
pert=giv_pert ; 

clus_centr_old=giv_clus_centr ; 
else 

M=codbksize; 

A_inv=ones (codbksize , 1) *reshape (A_inv , 1 , col~2) ; 
end  */,if  init_clus 
*/, - End  splitting  loop 

'ItVItltlt  find  suitable  A  mtx  for  distance  calc  */,*/,*/. 
if  Distnce==l  */,Euclidecin 
A_matrix=ones(col,M) ; 

elseif  Distnce==2  ‘/.Diagonal  cov  (Ecld  divided  by  std) 
A_matrix=diag(covr)*ones(l ,M) ; 
end  */.  if  Dis 

I - 


initialization  for  distortion  calc  loop 
m=0;  */, counter  for  iterations  each  sp-stag 
oldmxmov=0 ;  var=0 ;  '/.to  control  the  number  of  iterations 
Dist_old=realmax;  '/.initialize  distortion  each  sp-stage 
Dist_new=99~49 ; 

inx_movcc=100;  '/.To  check  stability  of  moving  cc  in  WMD 
Distance=zeros(row,M) ;  '/.  Distance  matrix 
list=[]  ; 

consc=ones(l,M) ; 

'/,#########  Start  Distortion  Calculations  loop  ############# 
while  ((Dist_old-Dist_new)/Dist_new>epsl)  &  mx_movcc>epsl2 
clus_centr_old=clus_centr_new; 

U=zeros(row,M)  ;  '/.  Membership  matrix  update  each  iter 
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i=0; 

VI,VI>  find  new  distance  matricies  then  U  mtx  ’/,*/, */«7o’/o’/.’/. 
while  i<row 
i=i+l ; 

if  Distnce  "=3 

Distance(i, :)=suni(((clus_centr_new’-(data_in(i, 

*ones(l,M)) . ~2) ./A_matrix) ; 

else 

for  k=l:M 

Distance (i,k)=[clus_centr_new(k, : )-data_in(i , 
reshape (A_inv(k, :) , col, col)*. . , 

[clus_centr_new(k, :)-data_in(i, ; 
Distance_wocon(i,k)=Distance(i,k)/consc(k) ; 
end  '/.for  k 
end  '/.if  Dis 

if  hard_fuzzy==l  '/.  HCM  (  LEG  ) 

Mpara=3 ; 

[nn,ii]=min(Distance(i, :)) ; 

U(i,ii)=l; 

elseif  hard_fuzzy==2  '/.  Fuzzy  C_means 
for  j=l:M 

U(i, j)=l/sum((ones(M,l)*sqrt(Distance(i, j)) ./. . . 
sqrt (Distance (i (2/ (Mpara-1) ) ) ; 
end  '/.  for  j 

else  error(’ invalid  choice’) 
end  '/.  end  if  hard 
end  '/.  end  for  i 

'/. - end  Distance  &  U  calc 
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Update  the  covariance  mtx,  pert,  clus-centr  VlXVItVIXVh 
[duirany,asignd_class]=max(U’) ; 
clear  dummy 

clus_centr_new=((l./sum(U.~Mpara)) ’*ones(l,col) ) . 

( (U. “Mpara) ’*data_in) ; 
for  p=l :M 

indx=f ind(asignd_class==p) ; 
len_class=length(indx) ; 
cls_tot (p)=len_class ; 
if  len_class>l 
covr=cov(data_in(indx, :))  ; 

if  len_class<col,  covr=covr+eps*eye(col) ;  ’eps’,  end 
inv_covr=inv(covr) ; 
if  Distnce==3 
if  weight==0 

consc(p)=l;  */,  no  conscience 
elseif  weight==l 

consc(p)=len_class/row;  */,  conscience 
elseif  weight==2 

consc(p)=l/(det(inv_covr))“(l/col) ;  7,  volume  constraint 
end 

A_inv(p, :)=consc(p)*reshape(inv_covr,l,col'2) ; 
det(reshape(A_inv(p, :) , col, col)) ; 
end 
else 

covr=covr_all; 
end  7, if  len 

[vec,val]=eig(covr) ; 

[mval , inval] =max(diag(val) ) ; 
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pert(p,  :)=sqrt(mval)*vec(:  jinval)  ’ ;  7,  mx  vrnce  dirctn 
7,  pert(p,  :)  =  (vec*sqrt(diag(val)))  ’ ;  7«  sm  of  scaled  e-vec 

end  7t  for  p 
if  hard_fuzzy==l 
ind=f ind(sum(U)==l  |  sum(U)==0); 
len=length(ind) ; 
if  len>0 

[inxl,indl]=max(sum(U))  ; 

clus_centr_new(ind, :  )  =  (reindn(len,  l)+l)*clus_centr_new(indl , :)  ; 
end  7oif  len 


end  7.  end  if  hard 

7, - End  update 


7.7o7o7.  Check  stopping  criteria.  Distortion  or  mxmov2 
if  Distnce==3  |  hard_fuzzy==2 

inx_movcc=max(sum(clus_centr_new’-clus_centr_old’) . “2) 
JMD(m+l)=sum(sum((U.  “Mpara)  .*Disteince)) ; 
if  m<3,  inx_movcc=mx_movcc+le-6;end 
else 

Dist_old=Dist_new; 

Dist_new=sum(suin((U.~Mpara) .*Distance))  ; 
end  7.if  Dis 

7 - End  check 

7o7o  Distanced  :row,asignd_class’) 

7.  JMD  (m+1)  =s™(Distance  ( 1 :  row ,  asignd_class  ’ )  ) 
m=m+l ; 

lisdist=[lisdist;M  m  Dist_new  inx_movcc]  ; 
var=var+sign(inx_movcc-oldinxmov)  ; 

list=[list ; var  reshape (clus_centr_new,l,M*col)  reshape (pert , 1 ,M*col)] ; 
oldmxmov=inx_movcc ; 
if  m==40 
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[aa,bb]=min(list(: ,1)) ; 

clus_ceiitr_new=reshape (list (bb , 2 : M*col+l ) , M , col ) ; 
pert=reshape(list(bb,M*col+2:2*M*col+l) ,M,col) ; 
inx_movcc=0; 
end  '/,  if  m 


7o‘/.’/.y.yc7.'/.'/.’/.‘/.*/o7.  To  display  results  in  2D  case 
if  plt==l 

if  col>2,data2d=eigtrans(data_in,2) ; 
else  data2d=data_in; 
end 

if  M<=6 

plotclusters(data2d,M,asignd_class,A_inv,col,clus_centr_new, . . 

Distnce,elps_cov,rsq,elps_cov, Which, savplt) 
elseif  M>6  &  col==2 

plotclusters2(data2d,M,asignd_class,A_inv,col,clus_centr_new, . 

Distnce,elps_cov,rsq,inn,mx,elps_cov, Which, savplt) 
end  y,if  M 

end  7, if  col 

7, - end  display  results 

end  7,  end  while  D 

7, - End  distortion  calc  loop 

if  init_clus"=l , break, ’p’ , end  7,  quit  in  case  of  no  splitting 
end  7,  end  while  M  (splitting) 

7, - End  main  loop 

yotot_time=etime  ( clock,  tm) 
tot_time=toc 
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itern=length(lisdist ( : , 1) ) 
megaflops=flops/toc/l .e6 


if  savres==l 

eval([’save  ./autosaves/’  int2str (Which)  ’res  clus_centr_new  A_inv 
lisdist  tot_time  asigned_class  megaflops  -ascii’]) 
elseif  savres==2 

eval([’save  ./autosaves/’  int2str(Which)  ’res  clus_centr_new  A_inv 
lisdist  tot_time  asigned_class  megaflops’]) 
end 

*/,  leave  space  when  you  use  save  commeuid  (save  ’) 

*/X/«*/o'/o  Iris  data  display 
if  Data_choice==3 
[a,b]=max(U’) ; 
b 

end 

7, - End  Iris 

7o  To  measure  the  mutual  MD  between  clusters 
7.  dist=zeros(codbksize,codbksize)  ; 
for  i=l : codbksize 
for  j=l : codbksize 

if  i"=j 

dist(i, j)=mah_dist(clus_centr_new(i, :) ,clus_centr_new(j ,:),... 
reshape (A_inv(i, :) , col, col)) ; 
end 

end 

end 
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’/o'/o’/o  To  run  a  movie  of  clustering  imtil  ctrl-c  is  pressed 
if  col==2  &  plt==l 
y.  while  1>0 

for  k=l:itern; 
figure (k) 
pause (1) 
end 

7o  end 

end 

I - 
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A.  1.1  getdata.m.  Any  data  set  to  be  partitioned  is  stored  in  a  specific  subdirec¬ 
tory  called  datasets  and  included  in  the  choices  of  the  main  program. 

fimction  data_in=getdata(Data_ choice) 
if  Data_choice==l 

load  /home/hawkeye6/96d/96d/kyoimis/dataset/xorData. asc 
data_in=xorData( : ,2:3) ; 
elseif  Data_choice==2 

load  /home/hawkeye6/96d/96d/kyoimis/dataset/gaussians3 
dat  a_ in=gaus  s i ans  3 ; 
elseif  Data_choice==3 

load  /home/hawkeye6/96d/96d/kyounis/dataset/irisall 
data_in=irisall ( : , 2 : 5) ; 
elseif  Data_choice==4 

load  /home/hawkeye6/96d/96d/kyounis/dataset/daisy .train 

y,  This  ignores  the  first  colm  which  is  supposed  to  give  class 

jjj=length(daisy(l, :)) ; 

data_in=daisy( : ,2: j j j) ; 

elseif  Data_choice==5 

load  /home/hawkeye6/96d/96d/kyounis/dataset/ intens_vec 
data_in=intens_vec ; 
elseif  Data_choice==6 

load  /home/hawkeye6/96d/96d/kyo\mis/dataset/unif 
data_in=unif (1 : 1000, : ) ; 
elseif  Data_choice==7 

load  /home/hawkeye6/96d/96d/kyounis/dataset/f lir . dat 
data_in-flir( : ,2:18) ; 
elseif  Data_choice==8 

load  /home/hawkeye6/96d/96d/kyounis/dataset/8dec44 
data_in=intens_vec ; 
elseif  Data_choice==9 
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load  /home/hawkeye6/96d/96d/kyounis/dataset/s_shaped 

data_in=s_shaped; 

elseif  Data_choice==10 

load  /home/hawkeye6/96d/96d/kyounis/dataset/linear2 

data_in=lin_data; 

else  error (’invalid  choice’) 

end 
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A.  1.2  KLI.m.  This  KLI  initialization  code  implemented  here  was  written  by  Dr. 


Martin  DeSimio. 

’/,  kli.m 

7, 

7,  Dr.  Martin  deSimio 
7o  22  august  1995 

7. 

7«  use  this  routine  to  get  an  initial  codebook  of  any  size 
7o  function  klicbk  =  kli2(y  ,Ncdw) ; 

7.  where 

7.  y  =  the  set  of  data,  one  fv  per  row 
7.  Ncdw  =  the  number  of  codewords  to  generate 

function  klicbk  =  kli2(y ,Ncdw) ; 

7t  make  data  set  zero  mecin 

7o  first  find  number  of  fv’s  (#  rows) 

ndrows  =  size(y,l); 

meany  =  mean(y); 

allmeany  =  one s (ndrows, l)*meany; 
y  =  y  -  allmeany; 

7o  compute  covariance  of  zero  mean  data,  then  find  eigenvectors  of 
7»  covariance  matrix 
covy  =  cov(y); 

[v,d]  =  eig(covy); 

7.  find  the  total  power 
P  =  trace (d) ; 
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*/,  get  the  number  of  dimensions 
dims  =  sizeCcovy, 1) ; 

y,  N(i)  is  the  number  of  initial  codewords  along  dimension  i 

N  =  zeros(dims,l) ; 

for  dimindx  =  l:dims  */,  for  each  dimension 

N(dimindx)  =  round(  Ncdw  *  d (dimindx, dimindx )/P  ); 

end 

ncheck  =  STim(N)  ; 
if (ncheck  "=  Ncdw) 

warning  =  ’not  right  number  of  codewords!’ 
delta_ncdw  =  ncheck  -  Ncdw 

if  (delta_ncdw  >  0)  '/,  find  dimension  with  most  codewords  and  subtract 
[a,b]  =  max(N) 

N(b)  =  N(b)  -  delta_ncdw; 

end 

if  (delta_ncdw  <  0)  '/,  find  dimenstion  with  least  codewords  and  add 
[a,b]  =  min(N) ; 

N(b)  =  N(b)  -  delta_ncdw;  7,  note  that  this  will  increase  N(b) 

end 

end 

N; 

newcheck=sum(N) ; 


y,  now  spread  codewords  out  along  eigenvectors; 

y,  assume  spread  of  +/-  2sigma;  this  keeps  codwords  within  the  data 
row  =  1; 
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for  i  =  l:dims 
if  (N(i)  "=0  ) 
sigm  =  sqrt(d(i,i)) ; 

if  (rem(  N(i),2))  ==  0  */,  then  even  number  of  codewords 

oddity  =  ’no’;  */,  in  this  dimension 
cell  =  4*sigm/N(i); 
for  j  =  l:N(i)/2 

icbkCrow,:)  =  v(:,i)’*  (l/2)*(2*j  -l)*cell; 
icbk(row+l , : )  =  -l*icbk(row, : ) ; 
row  =  row+2; 

end 

end 

if  (rem(  N(i),2))  ==  1  then  odd  number  of  codewords 
oddity  =  ’yes’;  */,  in  this  dimension 
if  (N(i)  ==  1) 

icbk(row,:)  =  sigm*0.001*  randn(l,dims) ; 
row  =  row+1;  */,  added  this  fix  on  26  July 
else 

icbk(row,:)  =  sigm*0.001*randn(l,dims) ; 
row  =  row+1; 

cell  =  4*sigm/(N(i)-l) ; 
for  j  =  1:  (N(i)-l)/2 

icbk(row,:)  =  v( : ,i) ’*j*cell; 
icbk(row+l , : )  =  -l*icbk(row, :) ; 
row  =  row+2  ; 

end 

end 

end 

end  */,  for  N(i)  not  =  0 
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end 


y,  now  add  the  meein  value  to  the  codebook  vectors 


addmn  =  ones(Ncdw,l)*meany; 
klicbk  =  icbk  +  addmn; 
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A.  1.3  plotclusters.m. 


function  plotclusters(data_in,M,asignd_class,A_inv,col,clus_centr_new, . . . 

Distnce,elps_cov,rsq,elps_cov, Which, savplt) 


mx_data=max(data_in) ; 
mn_data=min(data_in) ; 
nix=max(mx_data)+l ; 
nin=inin(inn_data)  -1 ; 

figure ( ’Position’ , [630,  700,  400  ,400]) 
hold  on 

lll=f ind(asignd_class==l) ; 

plot (data_in(lll , 1) ,data_in(lll ,2) , ’o’ ) 

112=f ind(asignd_class==2) ;  plot(data_in(112,l) ,data_in(112,2) , ’x’) 
set(gca, ’FontSize’ ,8) 

if  M==2,  legendC’Class  1’, ’Class  2’); 
elseif  M>2,113=find(asignd_class==3) ; 
plot(data_in(113,l) ,data_in(113,2) , ’+’) 

if  M==3,  legend (’CLASS  1’,’ CLASS  2’, ’CLASS  3’) 
elseif  M>3,114=f ind(asignd_class==4) ; 
plot(data_in(114,l) ,data_in(114,2) , ’ . ’) 
if  M==4, legend (’Class  1’,’ Class  2’, ’Class  3’, ’Class  4’); 
elseif  M>4,115=f ind(asignd_class==5) ; 

plot(data_in(115,l) ,data_in(115,2) , ’ro’) 

if  M==5,  legend (’CLASS  1’,’ CLASS  2’, ’CLASS  3’, ’CLASS  4’, ’CLASS  5’) 
elseif  M>5,116=find(asignd_class==6) ; 
plot(data_in(116,l) ,data_in(116,2) , ’r+’) 

legend( ’CLASS  1’, ’CLASS  2’, ’CLASS  3’, ’CLASS  4’, ’CLASS  5’, ’CLASS  6’) 
end 
end 
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end 


end 
y.  grid 

axis([mn  mx  mn  mx] ) 
axis (’square’) 
setCgca, ’FontSize’ ,15) 
if  col==2 
for  j=l:M 

elps_cov=reshape(A_inv(j , :) , col, col) ; 
ellipse (clus_centr_new(j , :) ,elps_cov,rsq) 
xlabeK ’First  Feature’) 
ylabeK ’Second  Feature’) 
end  y.for  j 
else 

xlabeK ’First  Principal  Component’) 
ylabeK ’Second  Principal  Component’) 
end  y,if  col 

setCgca, ’FontSize’ ,12) ; 
hold  off 


if  savplt==l 

plt_name=[’data’  int2str (Which) ]  ; 

evaK [’print  -deps  /home/hawkeye6/96d/96d/kyounis/PR/images/’  plt_name]) 
end  y,if  savplt 
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A. 2  Hyperellipsoidal  Clustering  (HEC)  algorithm 

(R) 

Following  are  the  MATLAB^  codes  used  to  implement  the  HEC  algorithm.  The 
code  HECVQ.m  implements  the  batch  vector  quantization  algorithm  described  in  HEC. 
Finally,  the  code  PC  A.m  implements  the  neural  network  for  principal  component  analysis. 

A.  2.1  HEC.m. 

•/.  HEC; 

function  class=HEC(Data_choice,codbksize) ; 
y, clear 
close  all 
tic 

flops (0) 
con_f lag=0 ; 
plt_flag=l; 

VhVL'Itl'/tVI*  load  input  data  and  substract  the  mean  then  normalize 
y,Data_choice=  input  (’Data  set  (1)  XOR  data  (2)  Gaussians  (3)  IRIS  (4)  Daisy 

(5)  Mandrill  (6)  uniform  (7)  flir  (8)  Lenna  . . 
(9)  sub-manifold  (10)  linear  :  ’); 

'/.codbksize  =  input (’ Codebook  size  :  ’); 
data_in=getdata(Data_ choice) ; 

[row,col]=size(data_in) ; 

mn=mean(data_in) ; 

data_in=data_in-ones (row, 1) *mn; 

max_length=max (diag (data_in*data_in ’ ) ) ; 

data_in=data_in . / (ones (row , col) *sqrt (max_length) ) ; 

mx_data=max(data_in) ; 

mn_data=min(data_in) ; 

mx=max(mx_data)+l ; 

mn=min(mn_data) -1 ; 

liml= [mn  mx  mn  mx] ; 
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I - 

UmU  initialize 

W=ones(codbksize,l)*reshape(eye(col) ,l,col'2) ; 

7oeval=ones (codbksize ,  col)  ; 

'/.deal  with  column  vectors  in  cc 
cc_old=0 . l*randn(col , codbksize) ; 

•/.cc_old=.l*[l  2;2  3  ;1  2;3  1]; 

V=cc_old; 

Y=data_in; 
lambda=0 ; 

Delt_lambda= . 2 ; 
lambdamin=0 ; 
epsl=le-6; 

S=ones( codbksize, col)  ; 

cons=ones(l, codbksize) ; 

thresh=le-4; 

diff=l ; 

m=0; 

oldclass=zeros(l ,row) ; 

inv_cov_mtx=ones (codbksize , 1) *reshape (eye (col) , 1 , col~2) ; 

'/, - End  initialization 

UlUmmU  start  main  loop 
while  diff>thresh 

[cc_new, class, B,mvq]=HECVq(data_in,Y,cc_old,V,W,S, lambda, m, liml ,con_f lag, . 

cons , inv_cov_mtx) ; 

if  Data_choice==3, class 
end  7, if  data 

if  col==2  &  plt_flag==l 

h3=figure( ’Position’ , [1 ,  600,  400  ,400]); 
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h4=figure( ’Position’ , [1 ,  100,  400  ,400]); 
end  */,if  col 


'/.*/.7o7o7o'/«y.7o7«  compute  new  mean,  cov  according  to  the  new  partitioning 

for  i=l : codbksize 

[val2 , ind2] =f ind(class==i) ; 

len=length(ind2) ; 

if  con_flag==l, cons (i)=len/row; end 
if  len>l 

7.7.7.7o7o7«7«7o7«  Get  evec,eval,mean  and  covr 
[trans,M,covr,vec,val]=eigtrans(data_in(ind2, :) ,2,0,0) ; 

I - 

cc_new( : ,i)=M( ; ) ; 
inv_cov=inv(covr) ; 

inv_cov_mtx(i,  :)=reshape(inv_cov,l,col'‘2) ; 

W(i, :) “reshape (vec,l,col''2)  ; 
eval(i, : )=val; 

S(i,  :)=sqrt(val+epsl)  .'"(-I)  ; 

V( : ,i)=S(i, : ) ’ .*(vec’*cc_new( : ,i)) ; 

Y(ind2, :)=(ones(len,l)*S(i, :)) .*(data_in(ind2, :)*vec) ; 
else 
’error’ 
if  len==l 

cc_new( : ,i)=data_in(ind2, :) ’ ; 
else 

cc_new( : ,i)=.01*randn(col,l) ; 
end  7.if  len=l 
end  7«if  length>l 

7.7o7o7o7o7.7o7.  plot  partitioning 
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if  col==2  &  plt_flag==l 
figure (h3) 

title([’m=  int2str(m)]) 

if  i==l,  plot(data_in(ind2,l) ,data_in(ind2,2) , ’ . ’ ) 
elseif  i==2,plot(data_in(ind2,l) ,data_in(ind2,2) , ’x’) 
elseif  i==3,plot(data_in(ind2,l) ,data_in(ind2,2) , 
elseif  i==4,plot(data_in(ind2,l) ,data_in(ind2,2) , ’o’ ) 
end  */,  if  i==l 
hold  on 

plot(cc_new(l,i) ,cc_new(2,i) , ’*’) 
ellipse (M,inv_cov, 1) 
end  '/oif  col 


*/, - End  plotting 

end  ‘/.for  i 

*/, - End  computation 


if  col==2  &  plt_flag==l 
figure (h3) , hold  off 
’/,  figure(h4)  .hold  off 
end  */,  if  col 

if  class==oldclass , break, else ,oldclass=class ; 
end  '/,if  class 

lambda=max ( lambdamin , lambda-Delt_lambda) ; 
m=m+mvq 

’/,  if  m== 10, break, end 
cc_old=cc_new; 
end  */,  while  diff 
y. - End  main  loop 


tt=toc 

megaf lops=f lops/tt/1 . e6 
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TlXVIXVIt  Display  results  after  rearranging 
nuin_fig=gcf ; 

‘/.while  1>0 

‘/.for  i=l:nm_fig 

‘/.figure (i)  .paused) 

‘/.end 

‘/.end 
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A.2.2  HECVQ.m. 


function  [cc_new, class, B,mvq]=HECVQ(nonn_ data, y,cc_old,V,evec,S, lambda, . . . 

mm, liml , con_f lag, cons , inv_cov_mtx) 

[row, col] =size(norm_ data) ; 

'/.deal  with  column  vectors  in  cc  and  V 

codbksize=length(cc_old(l, :)) ; 

yX/’/XIX/,  Init ializat ion 

B=0; 

mvq=0 ; 

diff=l; 

thresh=le-4; 

plt_flag=0; 

- 


mxixiximv,  start  the  vq  loop 

while  diff>=thresh 

cc_old; 

VI'IXI'l'ItVLVI,  Find  new  partitions  (class)  and  projection  of  data  points 
for  i=l:row 
test=norm_data(i , :) ; 

’/.find  distortions 
for  j=l : codbksize 

yl(j , : )=  S(j , : ) .*(test*reshape(evec(j , :) , col, col)) ; 
D(j)=cons(j)*(lambda*euclid(test’ ,cc_old(: ,j))+. . . 

(1-lambda) *euclid(yl ( j ,:)’,V(:,j))); 

end  ’/.for  j 

[val , ind] =min(D) ; 
class(i)=ind; 
y(i, ;)=yl(ind, :) ; 
end  ’/.for  i 
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7. 


ThVlt’IXVhVIX  compute  the  new  means  k  their  projection 
list=[]  ; 

for  i=l:codbksize 
[val2,ind2]=f ind(class==i) ; 
len(i)=length(ind2) ; 

if  con_flag==l,  cons(i)=max(len(i)/row, l/codbksize) ; 
end  7,  if  con 

if  len(i)>l 

cc_new(: ,i)=mean(norm_data(ind2, ; 

else 

’one/no  sample ’,i 
list=[list  i] ; 
end  7.if  len 
end  7.for  i 

bad_cc=length(list) ; 

[duml ,dum2] =sort (len) ; 
big_clus=fliplr(dum2) ; 

for  i=l:bad_cc 

if  mm==0,  pert=le-2*randn(l,col) 
else 

pert=S(big_clus(i) ‘ (-2)* (reshape (evec(i ,:) ,col,col) )  ' 
evec(list(i) , :)=evec(big_clus(i) , ; 
inv_cov_mtx(list(i) , : )=inv_cov_mtx(big_clus(i) , ; 
end  7.if  mm 

cc_new( : ,list(i))=cc_new( : ,big_clus(i))-pert’ ; 
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cc_new(: ,big_clus(i) )=cc_new( : ,big_clus(i))+pert’ ; 
end  */,  for  i 

7,  compute  the  centroids  in  the  whitened  space 
for  i=l : codbksize 

V( : , i)=S(i , : ) ’ (reshape (eve c (i, : ) , col, col) ’ *cc_new( : ,i)) ; 
end 

5^ - - 


y.7.y.7.y.y.7.7.y.y.y.7.y.  piot  partitions 

if  col==2  &  plt_flag==l 
hl=figure ( ’Position’ , [500,  1,  700  ,700]); 
y,  h2=figure( ’Position’ ,  [820,  1,  300  ,700]); 
for  i=l : codbksize 
[val2,ind2]=find(class==i) ; 
figure (hi) 

plot (cc_new(l , i) ,cc_new(2,i) , ’*’) 
hold  on 

if  i==l,plot(norm_data(ind2,l) ,norm_data(ind2,2) , ’ . ’) 
elseif  i==2,plot(norm_data(ind2,l) ,norm_data(ind2,2) , ’x’) 
elseif  i==3,plot(norm_data(ind2,l) ,norm_data(ind2,2) , ’+’) 
elseif  i==4,plot(norm_data(ind2,l) ,norm_data(ind2,2) , ’o’) 
end 

ellipse (cc_new( : ,i) , reshape (inv_cov_mtx(i , : ) , col , col) ,1) 

axis(liml) 

end  y,  for  i 

figure(hl),  hold  off 

end  y,if  col 

'/o - End  Plotting 
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mvq=mvq+l 

if  mvq==10; break, 
end 

diff=max(diag(euclid(cc_old,cc_new))) ; 

cc_old=cc_new; 

end  ’/.while  diff 

’/. - End  VQ  loop 

A.2.3  PCA.m. 

clear 


load  input  data  and  substract  the  meein 
Data_choice=  input (’Data  set  (1)  XOR  data  (2)  Gaussians  (3)  IRIS  (4)  Daisy 

(5)  Mandrill  (6)  uniform  (7)  flir  (8)  Lenna  , . 
(9)  sub-manifold  :  ’); 


data_in=getdata(Data_choice) ; 

[row , col] =size (data_in) ; 
inn=mean(data_in)  ; 
norm_data=data_in-ones (row, 1) *mn; 


[vec,val]=eig(cov(norm_data)) ; 
[lambda,k]=sort(diag(val)) ; 
k=k(col : -1:1,1) 
vec=vec( : ,k) ; 

vec=vec . / (ones (col , 1) *max(abs (vec) )  ) 
lambda=lambda(col: -1:1,1)  ’ 


initialize 


thresh=le-2; 
W=rand(col,col) ; 
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i=0; 

H=zeros(l,col) ; 

U=triu(rand(col,col) ,1) ; 
eta=4e-2; 

Mu=9e-2 ; 
epoch=0; 

while  max(max(abs(U)))>thresh 
i=i+l ; 

pattern=norm_data(i, : ) ’ ; 

if  i==row , i=0 ; epoch=epoch+l , W . / (ones (col , 1) *max (abs (W) ) ) , end 
for  j=l:col; 

H ( j ) =W ( : , j ) ’ *pattern+H*U ( : , j ) ; 
end  */,  for  j 
Delt_W=eta*pattern*H ; 

Delt_U=-Mu*triu(H’*H,l) ; 

W=W+Delt_W; 

U=U+Delt_U; 

W=W./(ones(col,col)*max(max(abs(W)))) ;  ’/,  normalize  W 
if  epoch==100, break, end 
end  */(  while  max 
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Appendix  B.  Vector  Quantization  Code 


The  following  MATLAB^  code  is  used  to  perform  vector  quantization  coding  of 
images. 

B.l  ImageVQ.m 
y,  Vector  quantization 

y,f imction  [Xnew ,  kh]  =ImageVQ  (Data_ choice ,  codbksize  ,hard_f uzzy , Mpara , Distnce ,  init_clus) 


tm=clock; 

M=4;  y,  codebook  size 

R=log(M)/log(2)  ;  y,no.  of  bits  to  represent  codebook  (  M=  2~R  ) 
y,  if  nargin<l 

Mpara=0;  */,  to  set  it  zero  if  not  fuzzy 

y.y.y.y.y.y.y.  Get  parameters  y;/.yy.y.y.y.y.y.y.y.y.y.y. 

Data_choice=  input(’  Data  set  (1)  XOR  data  (2)  Gaussians  (3)  IRIS  (4)Daisy... 

(5)  Mandrill  (6)  uniform  (7)  flir  (8)  Lenna  . . . 
(9)  sub-manifold 

codbksize  =  input (’ Codebook  size  :  ’); 

hard.fuzzy  =  input (’ Clustering  Algorithm  (1)  LEG  (  HCM)  (2)  Fuzzy  C_means  :  '); 
Distnce  =  input ( ’Similarity  Measure  (1)  Euclidean  Distance  ... 

(2)  Diagonal  covariance  . . . 

(3)  changing  Mahalanobis  :  ’); 

init_clus  =  input (’ Codebook  Generation  (1)  Splitting  (2)  first  C  samples  ... 

(3)  Minimax  (4)  Bezdek  (5)  Desimio  (6)  get-then-split  :  ’); 

if  hard_fuzzy==2 

mm  =  input (’M  parameter  for  Fuzzy  C.meeins  :  ’); 
end 

y.  endy,if  nargin 


Which- [Data_ choice  codbksize  hard_fuzzy  Mpara  Distnce  init_clus] 
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'amxu  load  the  test  image  y.7.m7x/xmy.‘/.*/.7.y.7.y,y.y.7.7.y.7. 

clear  dis_test 
if  Data_choice==5 
load  ./autosaves/mandrill 
[rom, com] =size (X) ; 

ival  =  mean (map’ )’ ;  7.  the  average  of  the  3  values  of  colormap  (B&W  intensity) 
newmap= [  ival  ival  ival] ; 
for  i=l:rom 
for  j=l:com 

X_intens(i, j)=ival(X(i, j))  ; 

7,  Matrix  of  480x500  of  intensity  rather  than  indecies  as  in  X 
end 


end 

elseif  Data_choice==8 

load  ./autosaves/lenna 

X_intens=lenna ; 

else  error (’ invalid  choice’) 

end 

[ro , co] =size (X_intens) ; 

I - 


7.7.7.y.y.y.7.y.y.y.7.y.7.  Decimate  the  image  7x/.7.7x/.y.7.y.y.y.7.7.7.7.7.y.7.7.7.7.y. 

7,  Get  upper-left  pixel  from  4x4  block,  and  cind  arrange  each  4x4  of  them  in  a  row 
7.  [intens_vec,x_intens]=decimate(X_intens,4,4) ; 

7o  Or  get  saved  decimated  image  directly 

eval([’load  /home/hawkeye6/96d/96d/kyounis/dataset/’  int2str (Data_choice)  ’dec44’]) 


7.- 


91 


vavmvh  Find  codebook  7.y.y///my.y//.yx/.‘/x/.7.y.*/.y.7.*/.7. 

y. [A2 , lis2]  =Cluster (intens_vec  ,M)  ; 

y,  Or  you  can  load  predetermined  codebook  (and  A_inv  matrix  if  needed) 
y,  to  load  the  codebook  of  the  same  image 

load  /home/hawkeye6/96d/96d/kyo\inis/PR/clustering/autosaves/8410352res 
y,  to  load  a  specific  codebook 

y,load  /home/hawkeye6/96d/96d/kyounis/PR/clustering/autosaves/541035res 


VhV/'IXVI'l,  Perform  VQ  to  reconstruct  the  image  VI'IX'I'I'I'I'IX 
A2=clus_centr_new ; 
for  i=l:4:ro 
for  j=l:4:co 

x_test=reshape(X_intens(i:i+3, j : j+3) ,1,16) ; 
dif f =ones (M, 1) *x_test-A2 ; 
dis_test=sum(dif f .  '"2  ’ ) ; 

[R,ind]=min(dis_test) ; 

Xnew(i : i+3, j : j+3)=reshape(A2(ind, : ) ,4,4) ; 


eval([’save  ./autosaves/’  int2str (Which)  ’  Xnew’]) 


y.y.y.y.y.y.7.y,y.y.y.7.yy.  compute  distortion  /  view  image  7.7.y.y.y.7.7.y,y.y.y.y.7.7.y.y.y. 

dist=mean(mean(  (X_intens-Xnew)  .  ”'2) ) 
mmm=max (max (Xnew) ) ; 

Xnew3=Xnew*65/mmm;  '/.scale  such  that  max  grey  level  is  65 
figure( ’Position’ , [750,  700,  300  ,300]) 
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image (Xnew3) 
axis  off 
colormap(gray) 

plt_name=[’ recons’  int2str (Which)] 

eval( [’print  -deps  /home/hawkeye6/96d/96d/kyoimis/PR/VQ/images/’  plt_name] ) 

I - 

tot _time=etime (clock, tm) 


93 


Bibliography 


1.  Backer,  E.  and  A.  K.  Jain.  “A  Clustering  Performance  Measure  Based  on  Fuzzy 
Set  Decomposition,”  IEEE  Trans.  Pattern  Anal.  Machnine  Intel!.,  PAMI-3 {l):66-74 
(1981). 

2.  Ball,  G.  H.  and  D.  G.  Hall.  “Some  fundamental  Concepts  and  Synthesis  Procedures  for 
Pattern  Recognition  Preprocessors,”  In  Proc.  Int.  Conf.  Microwaves,  Circuit  Theory, 
Information  Theory,  281-297  (September  1964).  Tokyo,  Japan. 

3.  Baum,  L.,  et  al.  “A  maximization  Technique  Occuring  in  the  Statistical  Analysis  of 
Probailistic  Functions  of  Markov  Chains,”  Ann.  Math  Statistics,  164-171  (1970). 

4.  Bezdek,  James.  Pattern  Recognition  with  Fuzzy  Objective  Function  Algorithm. 
Plenum  Press,  New  York,  1981. 

5.  Bezdek,  James,  et  al.  “Some  New  Competitive  Learning  Schemes,”  SPIE,  2492 
(1995). 

6.  Bezdek,  James  C.  Fuzzy  Mathematics  in  Pattern  Classification.  PhD  dissertation, 
Cornell  Univ.,  Ithaca,  NY,  1973. 

7.  Bezdek,  James  C.  Self- Organization  and  Clustering  Algorithms.  Technical  Report, 
Defense  Technical  Information  Center  (DTIC)  N91-21783,  1991. 

8.  Carpenter,  G.  and  S.  Grossberg.  “Adaptive  Resonance  Theory:  Stable  Self  Organiza¬ 
tion  of  Neural  Recognition  Code  in  Response  to  Arbitrary  Lists  of  Input  Patterns,” 
In  Proc.  8th  Annual  Conf.  Cognitive  Science,  45-62  (1986). 

9.  Darken,  Christian  and  John  Moody.  Fast  Adaptive  k-means  Clustering:  Some  Imper- 
ical  Results.  Technical  Report,  AFOSR  Grant  89-0478,  1989. 

10.  DeSieno,  D.  “Adding  Conscience  to  Competitive  Learning,”  IEEE  Proceedings  Int. 
Conf.  in  Neural  Networks  (ICNN-88),  117-124  (1988).  San  Deigo,  CA. 

11.  DeSimio,  Martin,  “AFIT/EENG  817  Class  Notes  and  Personal  Interviews.”  The  Air 
Force  Institute  of  Technology,  1995. 

12.  Duda,  Richard  O.  and  Peter  E.  Hart.  Pattern  Classification  and  Scene  Analysis.  John 
Wiley  and  Sons,  1973. 

13.  Dunn,  J.  C.  “A  Fuzzy  Relative  of  the  ISODATA  Process,”  Cybernetics,  32-57  (March 
1974). 

14.  Gath,  1.  and  A.  B.  Geva.  “Unsupervised  Optimal  fuzzy  Clustering,”  IEEE  Trans,  on 
Pattern  Analysis  and  Machine  Intelligence,  PAMI-ll{7):77i-781  (July  1989). 

15.  Gersho,  Allen  and  Robert  M.  Gray.  Vector  Quantization  and  Signal  Compression. 
Boston,  MA:  Kluwer,  1992. 

16.  Gonzalez,  Rafael  C.  and  Paul  Wintz.  Digital  Image  Processing  (2nd  Edition).  Addison 
Wesley,  1987. 

17.  Gray,  Robert  M.  “Vector  Quantization,”  IEEE  ASSP  Magazine,  1:4-29  (1989). 


94 


18.  Jain,  A.  K.  and  R.  C.  Dubes.  Algorithms  for  Clustering  Data.  Englewood  Cliffs,  NJ: 
Prentice-Hall,  1988. 

19.  Jain,  Anil  K.  and  Jianchang  Mao.  Computational  Intelligence  Imitating  Life.  IEEE 
Press  Marketing,  1994. 

20.  Jolion,  J.  M.,  et  al.  “Robust  Clustering  with  Applications  in  Computer  Vision,”  IEEE 
Trans.  Pattern  Anal.  Machine  Intel!,  PAMI- 1 3 (8):791-802  (1991). 

21.  Katsavounidis,  loannis,  et  al.  “A  New  Initialization  Technique  for  Generalized  Lloyd 
Iteration,”  IEEE  Signal  Processing  Letters,  1  (10):144-146  (October  1994). 

22.  Keller,  James  M.,  et  al.  “A  Fuzzy  K-Nearest  Neighbor  Algorithm,”  IEEE  Transactions 
on  Systems,  Man,  and  Cybernetics,  SMC-15 (A):580-585  (July/August  1985). 

23.  Kohonen,  Tuevo  T.  Self- Organization  and  Associative  Memory.  Springer- Verlag, 
Berlin,  1989. 

24.  Linde,  Y.,  et  al.  “An  Algorithm  for  Vector  Quantizer  Design,”  7EEE  Trans.  Commun., 
COM-28:84-95  (1980). 

25.  Mao,  Jianchang  and  Anil  K.  Jain.  “A  Self-Organizing  Network  for  Hyper  ellipsoidal 
Clustering,”  IEEE  Transactions  on  Neural  Networks,  7(l):16-29  (January  1996). 

26.  McQueen,  J.  B.  “Some  Methods  of  Classification  and  Analysis  of  Multivariate  Ob¬ 
servations,”  In  Proc.  5th  Berkeley  Symp.  Math.  Statis.  Probability,  281-297  (1967). 

27.  Patrick,  Edward  A.  and  Leon  Y.  L.  Shen.  “Interactive  Use  of  Problem  knowledge 
for  Clustering  and  Decision  Making,”  IEEE  Tran,  on  Computers,  216-222  (February 
1971). 

28.  Reynolds,  Douglas  A.  A  Gaussian  Mixture  Modeling  Approach  to  Text- Independent 
Speaker  Identification.  PhD  dissertation,  Georgia  Institute  of  Technology,  August 
1992. 

29.  Rogers,  Steven  K.,  et  al.  An  Introduction  to  Biological  and  Artificial  Neural  Networks. 
Bellingham,  Washington:  SPIE  Optical  Engineering  Press,  1991. 

30.  Rubner,  J.  and  P.  Tavan.  “A  Self  Organizing  Network  for  Principal  Component 
Analysis,”  Europhysics  letters,  10:693-698  (1989). 

31.  Ruck,  Dennis  W.  and  Steven  Rogers.  “The  Multilayer  Perceptron  as  an  Approxi¬ 
mation  to  a  Bayes  Optimal  Discriminant  Function,”  IEEE  Transactions  on  Neural 
Networks,  l(2):296-298  (December  1990). 

32.  Ruck,  Dennis  W.,  et  al.  “The  Multilayer  Perceptron  as  an  Approximation  to  a  Bayes 
Optimal  Discriminant  Function,”  IEEE  Transactions  on  Neural  Networks  (1990). 

33.  Tou,  Julius  T.  and  Rafael  C.  Gonzalez.  Pattern  Recognition  Principles.  Reading, 
MA:  Addison- Wesley  Publishing  Company,  1974. 

34.  Windham,  M.  P.  “Cluster  Validity  of  the  Fuzzy  c-means  Clustering  Algorithm,”  IEEE 
Trans.  Pattern  Analysis  and  Machine  Intelligence,  PAMI-4  {A):357-363  (1982). 

35.  Younis,  Khaled  S.,  et  al.  “Vector  Quantization  Based  on  Dynamic  Adjustment  of 
Mahalanobis  Distance,”  IEEE  National  Aerospace  and  Electronics  Conference  (NAE- 
CON),  .2:858-862  (May  1996). 


95 


36.  Zadeh,  Lotfi  A.  “Fuzzy  Sets,”  Information  and  Control,  5:338-353  (1965). 


96 


Vita 


Captain  Khaled  S.  Younis  iiitfHHPHHHHHpVIIIMMMI  He  graduated  Firet 
in  the  1M6  high  sdiool  claae  at  Dirar  Ibnel  Azwar  School  in  Amnum,  Jordan.  In  1990, 
he  graduated  from  Mu’tah  University  with  a  Bachelor  of  Science  Degree  in  Electrical 
Engineering  and  he  was  aw^ed  a  Royal  prize  granted  for  the  first  of  the  Engineering 
graduates.  Upon  graduation,  Capt.  Younis  received  his  commission  as  a  Second  Lieutenant 
in  the  Royal  Jordanian  Air  Force  (RJAF).  In  1994,  he  was  selected  by  the  RJAF  to  earn  his 
Master  of  Science  Degree  in  Electrical  Engineering  at  the  Air  Force  Institute  of  Technology 
(AFIT)  at  Wright-Patterson  AFB,  Ohio.  Capt  Younis  was  awarded  a  scholarship  by  the 
Dayton  Area  Gradtiate  Studies  Institute  (DAGSI)  to  pursue  a  Doctorate  of  Philosophy 
Degree  in  Electrical  Engineering  after  completion  of  his  Master’s  program  at  AFIT. 


97 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
0MB  No.  0704-0188 


X 


Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  i  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources, 
gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  colleaion  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspea  of  this 
collection  of  information,  including  suggestions  for  r^ucing  this  burden,  to  Washington  Headquarters  Services.  Oireaorate  tor  Information  Operations  and  Reports,  1215  Jefferson 
Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302.  and  to  the  Office  of  Management  and  Budget.  Paperwork  Reduction  Project  {0704-0188).  Washington.  DC  20503. 

1.  AGENCY  USE  ONLY  (Leave  blank) 

^  REPORT  DATE 

December  1996 

3.  REPORT  TYPE  AND  DATES  COVERED 

Master’s  Thesis 

4.  TITLE  AND  SUBTITLE 

WEIGHTED  MAHALANOBIS  DISTANCE  FOR 

HYPER-ELLIPSOIDAL  CLUSTERING 

5.  FUNDING  NUMBERS 

6.  AUTHOR(S) 

Khaled  S.  Younis 

Captain,  RJAF 

7.  PERFORMING  ORGANIZATION  NAME(S)  ANO  ADORESS(ES) 

Air  Force  Institute  of  Technology 

WPAFB  OH  45433-6583 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

AFIT/GE/ENG/96D-22 

9.  SPONSORING /MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

Major  Thomas  Burns,  WL/AARA 

2010  fifth  Street,  Bldg.  23 

WPAFB,  OH  45433 

10.  SPONSORING /MONITORING 

AGENCY  REPORT  NUMBER 

11.  SUPPLEMENTARY  NOTES 

12a.  DISTRIBUTION /AVAILABILITY  STATEMENT 

Approved  for  public  release;  Distribution  Unlimited 

12b.  DISTRIBUTION  CODE 

13.  ABSTRACT  (Maximum  200  words) 

Abstract 

Cluster  analysis  is  widely  used  in  many  applications,  ranging  from  image  and  speech  coding  to  pattern  recog¬ 
nition.  A  new  method  that  uses  the  weighted  Mahalanobis  distance  (WMD)  via  the  covariance  matrix  of  the 
individual  clusters  as  the  basis  for  grouping  is  presented  in  this  thesis.  In  this  algorithm,  the  Mahalanobis  dis¬ 
tance  is  used  as  a  measure  of  similarity  between  the  samples  in  each  cluster.  This  thesis  discusses  some  difficulties 
associated  with  using  the  Mahalanobis  distance  in  clustering.  The  proposed  method  provides  solutions  to  these 
problems.  The  new  algorithm  is  an  approximation  to  the  well-known  expectation  maximization  (EM)  procedure 
used  to  find  the  majdmum  likelihood  estimates  in  a  Gaussian  mixture  model.  Unlike  the  EM  procedure,  WMD 
eliminates  the  requirement  of  having  initial  parameters  such  as  the  cluster  means  and  variances  as  it  starts  from 
the  raw  data  set.  Properties  of  the  new  clustering  method  are  presented  by  examining  the  clustering  quality  for 
codebooks  designed  with  the  proposed  method  and  competing  methods  on  a  variety  of  data  sets.  The  compet¬ 
ing  methods  are  the  Linde-Buzo-Gray  (LBG)  algorithm  and  the  Fuzzy  c-means  (FCM)  algorithm,  both  of  them 
use  the  Euclidean  distance.  The  neural  network  for  hyperellipsoidal  clustering  (HEC)  that  uses  the  Mahalnobis 
distance  is  also  studied  and  compared  to  the  WMD  method  and  the  other  techniques  as  well.  The  new  method 
provides  better  results  than  the  competing  methods.  Thus,  this  method  becomes  another  useful  tool  for  use  in 

_ clii.stprincr. _ 

14.  SUBJECT  TERMS 

Clustering,  Unsupervised  Learning,  Mahalanobis  Distance,  Neural  Networks,  Pat- 

15.  NUMBER  OF  PAGES 

109 

tern  Recognition 

16.  PRICE  CODE 

17.  SECURITY  CLASSIFICATION 

OF  REPORT 

UNCLASSIFIED 

18.  SECURITY  CLASSIFICATION 

OF  THIS  PAGE 

UNCLASSIFIED 

19.  SECURITY  CLASSIFICATION 
OF  ABSTRACT 

UNCLASSIFIED 

20.  LIMITATION  OF  ABSTRACT 

UL 

NSN  7540-01-280-5500 


Standard  Form  298  (Rev.  2-89) 

Prescribed  by  ANSI  Std.  Z39-18 
298-102 


